R scripts for ADMIXTOOLS 2 [Nganasankhan from Anthrogenica]

Archetype0ne · May 30, 2021

I saw a very interesting post by Nganasankhan on Anthrogenica. I asked him wether he would like to post this on Eupedia, or to get his permission to post it with credit in case he is not interested. He said he is already getting bored with genetics and had no interest joining another forum. But here are some of his posts. If any of you guys can run scripts/code, his published codes could be of great use for other populations/examples. The following post is straight copied from his thread with his permission.

The scripts in this post use the 1240K+HO dataset, which can be downloaded from here: https://reich.hms.harvard.edu/allen-...cient-dna-data.

This installs ADMIXTOOLS 2 (https://github.com/uqrmaie1/admixtools):

Code:
install.packages("devtools")
devtools::install_github("uqrmaie1/admixtools")
library("admixtools")
When I tried to install `devtools`, it failed at first because there was an error in installing the dependency `gert` of its dependency `usethis`. By running `install.packages("gert")`, I saw that I needed to run `brew install libgit2`. Even then, `admixtools` failed to install because there was an error in installing its dependency `igraph`. By googling for the error, I saw that I needed to run `brew uninstall --ignore-dependencies suite-sparse`.

f3 as percentages with min-max scaling

This shows the amount of shared drift with Han when using Yoruba as an outgroup. The values are min-max scaled so that they are displayed on a scale where Villabruna is 0 and Nganasan is 100.

Code:
$ Printf% s \\ n Bashkir Besermyan Chuvash Enets Estonia_N_CombCeramic.SG Estonian.DG Finnish Finnish.DG Hungarian.DG Italy_North_Villabruna_HG Karelian Mansi Mari.SG Mordovian Nganasan Norway_Viking_o1.SG Russia_Bolshoy Russia_Chalmny_Varre Russia_Chalmny_Varre Russia_EHG Russia_MLBA_Krasnoyarsk_o Russia_Shamanka_Eneolithic.SG Saami.DG Selkup Tatar_Mishar Tatar_Siberian_Zabolotniye Todzin Tofalar Udmurt> pops
$ R -e 'library(tidyverse);library(admixtools);f3=f3("path/to/v44.3_HO_public","Yoruba","Han",readLines("pops"))%>%arrange(est);est=f3$est-min(f3$est);est=est/max(est);cat(paste(round(100*est),f3$pop3),sep="\n")'
[...]
0 Italy_North_Villabruna_HG
2 Hungarian.DG
6 Estonian.DG
12 Finnish.DG
12 Mordovian
13 Finnish
13 Russia_EHG
14 Karelian
20 Tatar_Mishar
26 Besermyan
27 Chuvash
28 Mari.SG
29 Russia_Chalmny_Varre
30 Udmurt
33 Saami.DG
37 Russia_MLBA_Krasnoyarsk_o
38 Bashkir
40 Norway_Viking_o1.SG
54 Russia_Bolshoy
55 Tatar_Siberian_Zabolotniye
57 Mansi
70 Selkup
80 Enets
90 Tofalar
94 hours
99 Russia_Shamanka_Eneolithic.SG
100 Nganasan
If you get a low number of SNPs that remain after filtering, try to remove samples with a low SNP count. For example in a run before the run shown above, my SNP count was 12245. I ran a command like this to find samples with a low SNP count:

Code:
$ awk -F\\t 'NR==FNR{a[$0];next}$8 in a{print$15,$8,$2}' pops path/to/v44.3_HO_public.anno|sort -n|head -n8
30001 Estonia_N_CombCeramic.SG MA974.SG
58972 Estonia_N_CombCeramic.SG MA975.SG
201719 Russia_Chalmny_Varre CHV002.A0101
205900 Norway_Viking_o1.SG VK518.SG
218757 Russia_Shamanka_Eneolithic.SG DA251.SG
287462 Russia_Bolshoy BOO002.A0101
289954 Russia_EHG UzOO77
300800 Russia_Shamanka_Eneolithic.SG DA250.SG

After I removed Estonia_N_CombCeramic.SG, my SNP count increased to 73054.

f3 as a heatmap

This applies min-max scaling to f3 values and displays the scaled values as percentages:

Code:
library(tidyverse)
library(admixtools)
library(pheatmap)
library(colorspace)

minmax=function(x)(x-min(x,na.rm=T))/(max(x,na.rm=T)-min(x,na.rm=T))

p1="Yoruba.DG"
p2=c("Nganasan","Russia_HG_Karelia","Norway_N_HG.SG","Turkey_N","Russia_HG_Tyumen","Han","Estonia_CordedWare","Russia_Bolshoy","Luxembourg_Loschbour","Russia_Samara_EBA_Yamnaya","Georgia_Kotias.SG")
p3=c("Nganasan","Estonian.DG","Selkup","Hungarian.DG","Karelian","Mansi","Udmurt","Mordovian","Saami.DG","Besermyan","Finnish","Finland_Levanluhta","Tofalar","Russia_MLBA_Krasnoyarsk_o","Russia_Bolshoy","Enets","Russia_Chalmny_Varre","Norwegian")

f3=f3("path/to/v44.3_HO_public",p1,p2,p3)

t=f3%>%select(pop2,pop3,est)%>%pivot_wider(names_from=pop2,values_from=est)%>%data.frame(row.names=1)

# don't show the f3 values of populations with themselves in order to not stretch the color scale
makena=names(which(sapply(rownames(t),function(x)any(names(t)==x))))
lapply(makena,function(x)t[x,x]<<-NA)

range=apply(t,2,function(x)paste(sub("^0","",sprintf("%.3f",range(x,na.rm=T))),collapse="-"))
names(t)=paste0(names(t)," (",range,")")

t=minmax(t) # use the same scale for the whole table
# t=apply(t,2,function(x)minmax(x)) # scale each column individually

pheatmap (
t*100,filename="a.png",
# cluster_cols=F,cluster_rows=F,
treeheight_row=70,treeheight_col=70,
legend=F,cellwidth=16,cellheight=16,fontsize=8,border_color=NA,
display_numbers=T,number_format="%.0f",fontsize_number=7,number_color="black",
colorRampPalette(colorspace::hex(HSV(c(210,170,120,60,40,20,0),.5,1)))(256)
)

Here min-max scaling was applied to the whole table on the left side and to each column individually on the right side:

The script below displays the original f3 values without scaling. It also demonstrates how `vegan::reorder.hclust` can be used to reorder the branches of the heatmap.

Code:
library(tidyverse)
library(admixtools)
library(pheatmap)
library(colorspace) # for hex
library(vegan) # for reorder.hclust

p1="Yoruba.DG"
p2=c("Nganasan","Russia_HG_Karelia","Norway_N_HG.SG","Turkey_N","Russia_HG_Tyumen","Han","Estonia_CordedWare","Russia_Bolshoy","Luxembourg_Loschbour","Russia_Samara_EBA_Yamnaya","Georgia_Kotias.SG","Finnish","Finland_Levanluhta","Russia_Shamanka_Eneolithic.SG")

f3=f3("path/to/v44.3_HO_public",p1,p2,p2)

t=f3%>%select(pop2,pop3,est)%>%pivot_wider(names_from=pop2,values_from=est)%>%data.frame(row.names=1)

diag(t)=NA # don't show the f3 values of populations with themselves in order to not stretch the color scale

pheatmap (
t,filename="a.png",
clustering_callback=function(...){reorder(hclust(dist(t)),wts=t[,"Han"]-t[,"Turkey_N"])},
display_numbers=apply(t,1,function(x)sub("^0","",sprintf("%.3f",x))),
legend=F,border_color=NA,cellwidth=16,cellheight=16,fontsize=8,fontsize_number=7,number_color="black",
# breaks=seq(.160,.190,.03/256), # use fixed limits for color scale
colorRampPalette(hex(HSV(c(210,170,120,60,40,20,0),.5,1)))(256)
)

Result:

f3-outgroup nMonte

This saves an f3 table as a CSV file so it can be used with Vahaduo or nMonte. It uses a shell wrapper for my simplified version of nMonte3.R.

Code:
$ printf% s \\ n Estonia_CordedWare Finland_Levanluhta Finnish Georgia_Kotias.SG Han Luxembourg_Loschbour Nganasan Norway_N_HG.SG Russia_Bolshoy Russia_HG_Karelia Russia_HG_Tyumen Russia_Samara_EBA_Yamnaya Turkey_Shamanka_EBA_Yamnaya Turkey_Shamanka_EBA
$ R -e 'library(tidyverse);library(admixtools);pops=readLines("pops");f3=f3("path/to/v44.3_HO_public","Yoruba.DG",pops,pops);f3%>%select(pop2,pop3,est)%>%pivot_wider(names_from=pop3,values_from=est)%>%mutate(across(where(is.double),round,6))%>%write_csv("f3")'
$ cat f3
pop2,Estonia_CordedWare,Finland_Levanluhta,Finnish,Georgia_Kotias.SG,Han,Luxembourg_Loschbour,Nganasan,Norway_N_HG.SG,Russia_Bolshoy,Russia_HG_Karelia,Russia_HG_Tyumen,Russia_Samara_EBA_Yamnaya,Russia_Shamanka_Eneolithic.SG,Turkey_N
Estonia_CordedWare,0.189574,0.048826,0.050368,0.046842,0.040376,0.05185,0.042084,0.049919,0.04784,0.050933,0.047266,0.051533,0.040938,0.049494
Finland_Levanluhta, 0.048826,0.148686,0.051402,0.045953,0.045448,0.052605,0.048901,0.053542,0.052522,0.053938,0.054203,0.051491,0.046715,0.048399
Finnish,0.050368,0.051402,0.052422,0.047057,0.042304,0.053757,0.04443,0.052616,0.048967,0.053264,0.049454,0.050761,0.04272,0.0495
Georgia_Kotias.SG,0.046842,0.045953,0.047057,0.185421,0.0394,0.045323,0.041384,0.046218,0.044534,0.047698,0.044243,0.048927,0.040304,0.046263
Han, 0.040376,0.045448,0.042304,0.0394,0.061217,0.040942,0.057035,0.041632,0.049274,0.043826,0.044467,0.040925,0.057394,0.038767
Luxembourg_Loschbour,0.05185,0.052605,0.053757,0.045323,0.040942,0.188232,0.043033,0.062498,0.050054,0.057409,0.049533,0.051483,0.04226,0.049539
Nganasan,0.042084,0.048901,0.04443,0.041384,0.057035,0.043033,0.070154,0.04448,0.054056,0.047517,0.047917,0.043077,0.059516,0.039822
Norway_N_HG.SG,0.049919,0.053542,0.052616,0.046218,0.041632,0.062498,0.04448,0.189739,0.053043,0.062429,0.053055,0.052822,0.043558,0.048115
Russia_Bolshoy, 0.04784,0.052522,0.048967,0.044534,0.049274,0.050054,0.054056,0.053043,0.063846,0.057001,0.054654,0.049657,0.051358,0.04458
Russia_HG_Karelia,0.050933,0.053938,0.053264,0.047698,0.043826,0.057409,0.047517,0.062429,0.057001,0.170124,0.062403,0.055766,0.046348,0.048079
Russia_HG_Tyumen,0.047266,0.054203,0.049454,0.044243,0.044467,0.049533,0.047917,0.053055,0.054654,0.062403,0.18867,0.054148,0.046497,0.043883
Russia_Samara_EBA_Yamnaya, 0.051533,0.051491,0.050761,0.048927,0.040925,0.051483,0.043077,0.052822,0.049657,0.0557660,0.0.05414,05414
Russia_Shamanka_Eneolithic.SG,0.040938,0.046715,0.04272,0.040304,0.057394,0.04226,0.059516,0.043558,0.051358,0.046348,0.046497,0.041908,0.064461,0.039068
Turkey_N,0.049494,0.048399,0.0495,0.046263,0.038767,0.049539,0.039822,0.048115,0.04458,0.048079,0.043883,0.047615,0.039068,0.055858
$ nm()(grep -v ^, "$1">/tmp/nm1;grep -v ^, "$2">/tmp/nm2;Rscript -e 'source("~/bin/nmon.R");a=as.numeric(commandArgs(T));getMonte("/tmp/nm1","/tmp/nm2",Nbatch=a[1],Ncycles=a[2],pen=a[3])' "${3-500}" "${4-1000}" "${5-.001}")
$ nm <(sed 1d f3|grep -v Finnish) <(sed 1d f3|grep Finnish)
Target: Finnish (d=.0028)
45.2 Russia_Samara_EBA_Yamnaya
35.6 Turkey_N
5.0 Russia_Bolshoy
3.4 Nganasan
3.0 Han
2.6 Luxembourg_Loschbour
1.4 Finland_Levanluhta
1.4 Norway_N_HG.SG
1.2 Russia_Shamanka_Eneolithic.SG
0.8 Russia_HG_Karelia
0.4 Estonia_CordedWare
$ cat ~/bin/nmon.R
getMonte=function(datafile,targetfile,Nbatch=500,Ncycles=1000,pen=0){
do_algorithm=function(selection,targ){
mySel = as.matrix (selection, rownames.force = NA)
myTg=as.matrix(targ,rownames.force=NA)
dif2targ=sweep(mySel,2,myTg,"-")
Ndata = nrow (dif2targ)
kcol=ncol(dif2targ)
rowLabels=rownames(dif2targ)
matPop=matrix(NA_integer_,Nbatch,1,byrow=T)
dumPop=matrix(NA_integer_,Nbatch,1,byrow=T)
matAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
dumAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
matPop = sample (1: Data, Nbatch, replace = T)
matAdmix=dif2targ[matPop,]
colM1 = colMeans (matAdmix)
eval1=(1+pen)*sum(colM1^2)
for(c in 1:Ncycles){
dumPop=sample(1:Ndata,Nbatch,replace=T)
dumAdmix=dif2targ[dumPop,]
for(b in 1:Nbatch){
store=matAdmix[b,]
matAdmix[b,]=dumAdmix[b,]
colM2 = colMeans (matAdmix)
eval2=sum(colM2^2)+pen*sum(matAdmix[b,]^2)
if(eval2<=eval1){
matPop=dumPop
colM1=colM2
eval1=eval2
}else{matAdmix[b,]=store}
}
}
fitted = t (colMeans (matAdmix) + myTg [1,])
popl=unlist(lapply(strsplit(rowLabels[matPop],";"),function(x)x[1]))
populations=factor(popl)
return(list("estimated"=fitted,"pops"=populations))
}

source=read.csv(datafile,header=F,row.names=1)
target=read.csv(targetfile,header=F,row.names=1)

for(i in 1:nrow(target)){
row=target[i,]
result=do_algorithm(source,row)

if(i>=2)cat("\n")
dist=sqrt(sum((result$estimated-row)^2))
cat(paste0("Target: ",row.names(row)," (d=",sub("^0","",sprintf("%.4f",dist)),")\n"))

tb=table(result$pops)
tb=sort(tb,decreasing=T)
tb=as.matrix(100*tb/Nbatch)
cat(paste(sprintf("%.1f",tb),row.names(tb)),sep="\n")
}
}
qpGraph with predefined node list

Code:
t=read.table(text="R Yoruba.DG
R A
TO THE
A AA
OE
O ONLY
AND I
E L1
I WH
I EH
AA EH
EH EH2
EH EH1
EH AAA
AA AAA
WH CCE
WH Italy_North_Villabruna_HG
CH YM
CH CH1
CH ON
EH1 YM
EH1 EH0
AAA AAN
CH1 EH0
CH1 Georgia_Kotias.SG
L1 Turkey_N_published
L1 LB
L1 YM1
L1 CW
EH2 LB
EH2 CWC
YM YM1
YM CW
EH0 Russia_HG_Karelia
AAN Nganasan
TO CCA
LB Germany_EN_LBK
LB CCW
YM1 Russia_Samara_EBA_Yamnaya
CW CCW
CW Estonia_CordedWare
CCW CWC
CWC CCE
CCE CCA
CCA Finnish.DG")

pop=c("Finnish.DG","Estonia_CordedWare","Germany_EN_LBK","Italy_North_Villabruna_HG","Nganasan","Russia_HG_Karelia","Russia_Samara_EBA_Yamnaya","Turkey_N_published","Yoruba.DG","Georgia_Kotias.SG")

f2=f2_from_geno("path/to/v44.3_HO_public",pops=pop)
gr=qpgraph(f2,t)

plot_graph(gr$edges)
ggsave("a.png",width=9,height=9)
system("mogrify -trim -border 64 -bordercolor white a.png")

plotly_graph(gr$edges) # view Plotly graph in browser
In order to generate a black-and-white graph, run `write_dot(gr$edges)` and paste the output here: http://viz-js.com.

qpAdm popdrop table as a stacked bar chart

The `popdrop` table contains alternative models for each non-empty subset of the left populations. The script below selects all models from the `popdrop` table with zero negative weights (where the value of the `feasible` column is not false) and with at least two left populations (where `f4rank` is not 0). It then plots the models sorted by their p score.

In qpAdm, the number of right populations needs to be greater than or equal to the number of left populations. Otherwise you'll get an error like this: `Error in qpadm_weights(f4_est, qinv, rnk = length(left) - 2, fudge = fudge: Mat::cols(): indices out of bounds or incorrectly used`.

I'm still afraid of qpAdm because I have no idea how to choose the outgroups, but I just selected a subset of the outgroups that were used in the paper titled "The Genomic History of Southeastern Europe": https://anthrogenica.com/showthread....pains-of-qpAdm. When the genetic distance between the right populations to left populations is high like in my model, the p values also tend to be higher.

Code:
library(admixtools)
library(tidyverse)
library(reshape2) # for melt
library(colorspace) # for hex
library(cowplot) # for plot_grid

left=c("Turkey_Boncuklu_N.SG","Estonia_CordedWare.SG","Latvia_HG","Russia_HG_Karelia","Finland_Levanluhta","Nganasan")
right=c("Ethiopia_4500BP_published.SG","Mbuti.DG","Russia_Ust_Ishim_HG_published.DG","Karitiana.DG","Papuan.DG","ONG.SG")
target="Mari.SG"

qp=qpadm("path/to/v44.3_HO_public",left,right,target)

t=qp$popdrop%>%filter(feasible==T&f4rank!=0)%>%arrange(desc(p))%>%select(1,4,5,7:last_col(5))
t2=melt(t[-c(2,3)],id.vars="pat") # this uses `melt` because `pivot_longer` doesn't preserve the order of columns

lab=sub("e-0","e-",sub("^0","",sprintf(ifelse(t$p<.001,"%.0e","%.3f"),t$p)))
lab=paste0(lab," (",ifelse(t$chisq<10,sub("^0","",sprintf("%.1f",t$chisq)),round(t$chisq)),")")

subtit=str_wrap(paste("Right:",paste(sort(right),collapse=", ")),width=58)

p=ggplot(t2,aes(x=fct_rev(factor(pat,level=t$pat)),y=value,fill=variable))+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
geom_text(aes(label=round(100*value)),position=position_stack(vjust=.5,reverse=T),size=3.5)+
ggtitle(paste("Target:",target),subtitle=subtit)+
coord_flip()+
scale_x_discrete(expand=c(0,0),labels=rev(lab))+
scale_y_discrete(expand=c(0,0))+ # remove gray margins on the left and right side of the plot
scale_fill_manual("legend",values=hex(HSV(c(30,60,180,210,270,330),.5,1)))+ # use manual colors
# scale_fill_manual(values=colorspace::hex(HSV(head(seq(0,360,length.out=1+ncol(t)-3),-1),.5,1)))+ # use automatic colors
guides(fill=guide_legend(ncol=2,byrow=F))+
theme(
axis.text.x=element_blank(),
axis.text.y=element_text(size=11),
axis.text=element_text(color="black"),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.direction="horizontal",
legend.key=element_rect(fill=NA), # remove gray border around color squares
legend.margin=margin(-4,0,1,0),
legend.text=element_text(size=11),
legend.title=element_blank(),
plot.subtitle=element_text(size=12),
plot.title=element_text(size=16)
)

ggdraw(p)
hei=c(.5+.2*(str_count(subtit,"\n")+1)+.25*nrow(t),.1+.23*ceiling(length(unique(t2[!is.na(t2$value),2]))/2))
ggdraw(plot_grid(p+theme(legend.position="none"),get_legend(p),ncol=1,rel_heights=hei))
ggsave("a.png",width=5.5,height=sum(hei))

In the image below, the number in parenthes after the p-value is chi-squared. Models with a large number of left populations tend to get a low chi-squared relative to the p-value. Models with only two left populations tend to get a high chi-squared relative to the p-value.

Print qpAdm popdrop table as plain text

Code:
library(admixtools)
library(tidyverse)

left=c("Turkey_Boncuklu_N.SG","Russia_Samara_EBA_Yamnaya","Latvia_HG","Russia_HG_Karelia","Nganasan","Finland_Levanluhta","Russia_Bolshoy")
right=c("Russia_Ust_Ishim_HG_published.DG","Russia_HG_Tyumen","Belgium_UP_GoyetQ116_1_published_all","Russia_Kostenki14","Iran_Wezmeh_N.SG","Georgia_Kotias.SG","Turkey_Epipaleolithic","China_YR_LN")
target="Finnish"

qp=qpadm("path/to/v44.3_HO_public",left,right,target)

t=qp$popdrop%>%filter(feasible==T&f4rank!=0)%>%arrange(desc(p))
p=apply(t%>%select(7:last_col(5)),1,function(x){y=round(100*sort(na.omit(x),decreasing=T));paste(paste0(y,"% ",names),collapse=" + ")})
paste0(sub("^0","",sprintf(ifelse(t$p<.001,"%.0e","%.3f"),t$p)),": ",p)%>%cat(sep="\n")
Output:

.630: 45% Russia_Samara_EBA_Yamnaya + 28% Turkey_Boncuklu_N.SG + 15% Latvia_HG + 11% Nganasan
.335: 41% Finland_Levanluhta + 35% Russia_Samara_EBA_Yamnaya + 24% Turkey_Boncuklu_N %vanaya
Russia_Samnaya_25_Boncuklu_N% % Turkey_Boncuklu_N.SG + 2% Nganasan
.011: 38% Russia_Samara_EBA_Yamnaya + 34% Turkey_Boncuklu_N.SG + 23% Russia_Bolshoy + 5% Latvia_HG
.010: 46% Russia_Samara_EBA_Yamnaya + 34%
Turkey_NGoncuklukluk : 42% Russia_Samara_EBA_Yamnaya + 35% Turkey_Boncuklu_N.SG + 24% Russia_Bolshoy
.004: 49% Russia_Samara_EBA_Yamnaya + 33% Turkey_Boncuklu_N.SG + 11% Nganasan + 7% Russia_HG_Karelia
.002: 58% Russia_Samara_EBA_Yamnaya + 31% Turkey_Boncuklu_N.SG + 12% Nganasan
7e-11: 50% Finland_Levanluhta + 35% Turkey_Boncuklu_N.SG + 16% Russia_HG_Karelia
5e-12: 59% Finland_Levan_Bluhta + Latvia.SG
7e-13: 52% Turkey_Boncuklu_N.SG + 39% Russia_HG_Karelia + Nganasan 9%
4e-13: 69% Finland_Levanluhta + Turkey_Boncuklu_N.SG 31%
2e-14: 52% Turkey_Boncuklu_N.SG + 27% Russia_HG_Karelia + Russia_Bolshoy 21%
1e-19 : 46% Turkey_Boncuklu_N.SG + 28% Russia_Bolshoy + 26% Latvia_HG
1e-23: 71% Finland_Levanluhta + 29% Russia_Samara_EBA_Yamnaya
4e-24: 54% Turkey_Boncuklu_N.SG + 46% Russia_HG_Karelia% Turkey 2N
-26_G Latvia_HG + 12% Nganasan
7e-27: 63% Turkey_Boncuklu_N.SG + 37% Russia_Bolshoy
8e-36: 55% Russia_Samara_EBA_Yamnaya + 36% Turkey_Boncuklu_N.SG + 9% Latvia_HG
6e-38: 63% Russia_Samara_EBA_Yamnaya + 37% Turkey_NBoncuk
5% .SG + Nganasan 15%
4e-53: 56% Turkey_Boncuklu_N.SG + Latvia_HG 44%
7e-61: 85% Russia_Samara_EBA_Yamnaya + 10% Nganasan + Latvia_HG 5%
4e-62: 90% Russia_Samara_EBA_Yamnaya + Nganasan 10%
3e-76: 98 % Russia_HG_Karelia + 2% Nganasan
5e-78: 88% Russia_Samara_EBA_Yamnaya + 12% Russia_Bolshoy
2e-116: 91% Latvia_HG + 9% Nganasan
1e-128: 92% Latvia_HG + 8% Russia_Bolshoy

plot_map

The `plot_map` function uses Plotly to generate an interactive map for entries from an anno file. You can see the code of the function by running `plot_map`.

Code:
library(admixtools)
library(tidyverse)

t=as.data.frame(read_tsv(gsub("\u00a0"," ",read_file("path/to/v44.3_HO_public.anno")))) # remove non-breaking spaces so `as.numeric` won't fail
t=t[,c(8,11,12,6)] # group, latitude, longitude, years BP
t=t[!grepl("\\.REF|rel\\.|fail\\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref|_o|_outlier",t[,1]),]
t=t[t[,2]!=".."&t[,3]!=".."&t[,2]!="-"&t[,3]!="-",] # remove lines with missing lat or long field (can be `..` or `-`)
t[t[,4]==0,4]=1 # change years BP 0 to 1 because entries with years BP 0 are not plotted when using a log scale (which is the default option)
a=aggregate(apply(t[,-1],2,as.numeric),list(t[,1]),mean)
a[,5]=a[,1]
names(a)=c("iid","lat","lon","yearsbp","group")
admixtools:lot_map(a)

A second option is to first run this in a shell:

Code:
$ igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref')
$ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[j]/n);print o}}' "FS=${1-$'\t'}")
$ awk -F\\t 'NR>1&&$11!=".."&&$11!="-"{print$8,$11,$12,$6}' path/to/v44.3_HO_public.anno|igno|grep -v _o|tav \ >map
$ head -n2 map
Ireland_EBA.SG 55.292132 -6.191685 3765.333333
Germany_EN_LBK 49.674249 9.582716 7066.370370
Then run this in R:

Code:
t=read.table("map");t[t[,4]==0,4]=1;t[,5]=t[,1];names(t)=c("iid","lat","lon","yearsbp","group");plot_map(t)'
A third option is to use Plotly directly:

Code:
library(plotly)

a=read.table("map")
label=ifelse(a[,4]==0,a[,1],paste0(a[,1],"\u00a0",round(a[,4])))
pl=plot_ly(a,type="scattermapbox",mode="markers",x=a[,3],y=a[,2],text=a[,1],hovertext=label,hoverinfo="text",marker=list(size=8,color=log10(a[,4]),colorscale="Portland",colorbar=list(thickness=15,outlinewidth=0)))
plotly::layout(pl,mapbox=list(style="stamen-terrain"),margin=list(t=0,r=0,b=0,l=0,pad=0))
In order to display the names of populations in addition to dots, you can use `mode="markers+text"`. It is currently only supported by the mapbox map styles which require a free access token: https://community.plotly.com/t/scatt...-as-text/35065. The `satellite-streets` style also includes the names of cities but `satellite` doesn't.

Code:
library(plotly)

a=read.table("map")
label=ifelse(a[,4]==0,a[,1],paste0(a[,1],"\u00a0",round(a[,4])))
pl=plot_ly(a,type="scattermapbox",mode="markers+text",x=a[,3],y=a[,2],text=label,textfont=list(color="white"),textposition="top",marker=list(color=log10(a[,4]),size=8,colorscale="Jet"))
plotly::layout(pl,mapbox=list(style="satellite-streets",accesstoken="youraccesstoken"),margin=list(t=0,r=0,b=0,l=0,pad=0))

FST or f2 heatmap

The default clustering method used by `pheatmap` is equivalent to `hclust(dist(t))`. Here the input is already a distance matrix, so you can use this instead: `clustering_callback=function(...)hclust(as.dist(t ))`.

Code:
library(admixtools)
library(pheatmap)
library(vegan) # for reorder.hclust
library(colorspace) # for hex

pop=unlist(strsplit("Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Saami.DG Selkup Udmurt"," "))

f=fst("path/to/v44.3_HO_public",pop)
# f=f2("path/to/v44.3_HO_public",pop)

# convert FST or f2 pairs to square matrix
df=as.data.frame(f[,-4])
df2=rbind(df,setNames(df[,c(2,1,3)],names(df)))
t=xtabs(df2[,3]~df2[,1]+df2[,2])

weights=t["Hungarian",]-t["Nganasan",]
# weights=rowMeans(t) # sort by average distance to other populations

pheatmap (
1000*t,filename="a.png",
clustering_callback=function(...)reorder(hclust(as.dist(t)),weights),
# clustering_callback=function(...)hclust(as.dist(t)),
legend=F,border_color=NA,cellwidth=18,cellheight=18,
treeheight_row=80,treeheight_col=80,
display_numbers=T,number_format="%.0f",number_color="black",
colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0),c(0,.5,.5,.5,.5,.5,.5),1)))(256)
)
Here the FST values are multiplied by 1,000 and not by 10,000 as usual:

Archetype0ne · May 30, 2021

The script in this post generates qpAdm models for a set of target populations, selects the model with the highest p-value for each target population, and displays the models as a stacked bar chart.

Code:
library(admixtools)
library(tidyverse)
library(reshape2) # for melt
library(colorspace) # for hex
library(cowplot) # for plot_grid

left=c("Turkey_N_published","Estonia_CordedWare.SG","Latvia_HG","Russia_HG_Samara","Russia_HG_Tyumen","Nganasan")
right=c("Russia_Ust_Ishim_HG_published.DG","Karitiana.DG","Papuan.DG","ONG.SG","Iran_Wezmeh_N.SG","Belgium_UP_GoyetQ116_1_published_all","Turkey_Epipaleolithic","Georgia_Kotias.SG","Japanese")
target=c("Finnish.DG","Estonian.DG","Selkup","Hungarian.DG","Karelian","Mansi","Udmurt","Mordovian","Saami.DG","Besermyan","Finnish","Enets","Mari.SG","Finland_Saami_Modern.SG","Russia_Bolshoy","Norway_Viking_o1.SG","Russia_Chalmny_Varre")

f2=f2_from_geno("path/to/v44.3_HO_public",pops=c(left,right,target))
qp=sapply(target,function(x)qpadm(f2,left,right,x))

# select model with highest p-value
qp2=apply(qp,2,function(x)x$popdrop%>%filter(feasible==T&f4rank!=0)%>%top_n(1,p)%>%select(5,7:last_col(5)))

# select model with highest number of left populations; choose highest p-value when tied
# qp2=apply(qp,2,function(x)x$popdrop%>%filter(feasible==T)%>%arrange(desc(f4rank),desc(p))%>%head(1)%>%select(5,7:last_col(5)))

qp3=do.call("rbind",qp2)
qp3$target=rownames(qp3)
qp3=qp3[order(qp3[,"Nganasan"],na.last=F),]
# nona=qp3;nona[is.na(nona)]=0;qp3=qp3[order(3*nona[,"Nganasan"]+nona[,"Russia_HG_Tyumen"]+.8*nona[,"Russia_HG_Samara"]),]
qp4=melt(qp3[,-1],id.var="target")

lab=sub("e-0","e-",sub("^0","",sprintf(ifelse(qp3$p<.001,"%.0e","%.3f"),qp3$p)))
capt=str_wrap(paste("Outgroups:",paste(sort(right),collapse=", ")),width=80)

p=ggplot(qp4,aes(x=fct_rev(factor(target,level=qp3$target)),y=value,fill=variable))+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
geom_text(aes(label=round(100*value)),position=position_stack(vjust=.5,reverse=T),size=3.5)+
labs(caption=capt)+
coord_flip()+
scale_x_discrete(expand=c(0,0),labels=rev(paste0(qp3$target," (",lab,")")))+
scale_y_discrete(expand=c(0,0))+ # remove gray area on left and right side of bars
scale_fill_manual(values=hex(HSV(c(30,60,180,210,120,330),.5,1)))+ # manual colors
# scale_fill_manual(values=hex(HSV(head(seq(0,360,length.out=length(unique(qp4$variable))+1),-1),.5,1)))+ # automatic colors
guides(fill=guide_legend(ncol=3,byrow=F))+
theme(
axis.text=element_text(color="black",size=11),
axis.text.x=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.direction="horizontal",
legend.key=element_rect(fill=NA), # remove gray border around color squares
legend.spacing.y=unit(.03,"in"), # reduce vertical space between color squares
legend.margin=margin(0,0,-4,0),
legend.text=element_text(size=10),
legend.title=element_blank(),
plot.caption=element_text(size=10)
)

hei=c(.05+.25*ceiling(length(unique(qp4[!is.na(qp4$value),2]))/3),.25*nrow(qp3)+.05+.18*(str_count(capt,"\n")+1))
ggdraw(plot_grid(get_legend(p),p+theme(legend.position="none"),ncol=1,rel_heights=hei))
ggsave("a.png",width=6,height=sum(hei))

The script in this post calculates a matrix of f3 distances between each pair of specified populations. It displays the distance to one specified population on the x axis and to another specified population on the y axis, but the clustering is based on the full matrix. The distances are displayed as min-max scaled percentages.

Code:
library(admixtools)
library(tidyverse)
library(ggrepel)

p1=c("Altaian","Altaian_Chelkan","Azeri","Balkar","Bashkir","Besermyan","Buryat","Chuvash","Daur","Dolgan","Enets","Estonian","Even","Evenk_FarEast","Evenk_Transbaikal","Finnish","Gagauz","Hungarian","Kalmyk","Karachai","Karakalpak","Karelian","Kazakh","Kazakh_China","Khakass","Khakass_Kachin","Kumyk","Kyrgyz_China","Kyrgyz_Kyrgyzstan","Kyrgyz_Tajikistan","Mansi","Mongol","Mordovian","Nanai","Negidal","Nganasan","Nogai_Astrakhan","Nogai_Karachay_Cherkessia","Nogai_Stavropol","Salar","Selkup","Shor_Khakassia","Shor_Mountain","Tatar_Kazan","Tatar_Mishar","Tatar_Siberian","Tatar_Siberian_Zabolotniye","Todzin","Tofalar","Tubalar","Turkish","Turkish_Balikesir","Turkmen","Tuvinian","Udmurt","Uyghur","Uzbek","Veps","Xibo","Yakut","Yukagir_Forest","Yukagir_Tundra")
p2=c("Korean","Yakut")
p3="Khomani"

f3=f3("path/to/v44.3_HO_public",c(p1,p2),c(p1,p2),p3)

t=f3%>%select(pop1,pop2,est)%>%pivot_wider(names_from=pop1,values_from=est)%>%data.frame(row.names=1)
t2=t[,p2]

t2=as.data.frame(apply(t2,2,function(x)100*(x-min(x))/(max(x)-min(x))))

t2$k=cutree(hclust(dist(t)),12)
names(t2)=c("x","y","k")

ggplot(t2,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.2)+
geom_polygon(data=t2%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.2)+
geom_point(aes(color=as.factor(k)),size=.7)+
geom_text_repel(aes(label=rownames(t2),color=as.factor(k)),max.overlaps=Inf,force=2,size=2.2,segment.size=.2,min.segment.length=.2,box.padding=.05)+
xlab(paste0("100 × minmax(f3(x, ",p2[1],", ",p3,"))"))+
ylab(paste0("100 × minmax(f3(x, ",p2[2],", ",p3,"))"))+
scale_x_continuous(breaks=seq(0,100,10))+
scale_y_continuous(breaks=seq(0,100,10))+
scale_color_manual(values=rainbow_hcl(n_distinct(t2$k),100,50))+
scale_fill_manual(values=rainbow_hcl(n_distinct(t2$k),100,50))+
theme(
axis.text=element_text(color="black",size=6),

axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"pt"),
axis.title=element_text(size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray90",fill=NA,size=.4),
panel.grid.major=element_line(color="gray90",size=.2)
)

ggsave("a.png",width=6,height=6)

Archetype0ne · May 30, 2021

This is from another thread by same author:

Shell and R scripts for SmartPCA and ADMIXTURE

Running SmartPCA and ADMIXTURE

The series of shell commands below downloads the 1240K+HO dataset and converts it to the PLINK format. It removes duplicate samples by selecting unique samples by the version ID, which is the third column in the anno file. It uses `plink --genome` to calculate IBD and IBS values, and it removes related samples by excluding samples with a PI_HAT value of .25 or above after LD pruning. It then runs SmartPCA with outlier removal not disabled, because without outlier removal often dimensions after the first few dimensions are dominated by populations with outliers. It generates G25-style CSV files for the output of SmartPCA. Finally it runs ADMIXTURE with the outliers detected by SmartPCA removed.

Code:
curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)

printf %s\\n Altaian Altaian_Chelkan Azeri Balkar Bashkir Besermyan Buryat Chuvash Daur Daur.DG Daur.SDG Dolgan Enets Estonian.DG Even.DG Evenk_FarEast Evenk_Transbaikal Finnish Finnish.DG Gagauz Kazakh Kazakh Hungarian Hungarian. Kyrgysh_China Kyrgyz_Kyrgyzstan Kyrgyz_kyrgyshstnkDG Kyrgyz_Tajikistan Kyrgyz_Tajikstan Mansi Mongol Mordovian Nanai Negidal Nganasan Nogai Nogai_Astrakhan Nogai_krcy_Cherkessia Nogai_Stavropol Oroqen.DG Oroqen.SDG Saami.DG Salar Selkup Shor_Khakassia Shor_mounten Tatar_Kazan Tatar_Mishar Tatar_saiberian Tatar_saiberian_Zabolotniye Todzin Tofalar Tubalar Turkish Turkish.DG Turkish_Balikesir Turkmen Tuvinian Udmurt Uyghur Uyghur.DG Uzbek Veps Xibo Xibo.DG Xibo.SDG Yakut Yakut.DG Yakut.SDG Yukagir_Forest Yukagir_Tundra>pop
x=uralaltaic;sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' pop ->$x.temp.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.temp.pick v44.3_HO_public.fam) --indep-pairwise 50 10 .1 --make-bed --out $x.temp
plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
awk 'FNR>1&&$10>=.25{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.temp.pick>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x

smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval autoshrink:\ YES lsqproject:\ YES numoutevec:\ 10 numoutlieriter:\ 5)|tee $x.smartpca
sed 1d $x.evec|tr : \ |awk 'NR==FNR{a[$1]=$2;next}{$1=a[$2]}1' $x.pick -|sed 's/ /:/'|sed 's/ [^ ]*$//'|tr ' ' ,>$x.csv
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[j]/n);print o}}' "FS=${1-$'\t'}")
sed 's/:[^,]*//' $x.csv|tav ,>$x.ave
for y in ave csv;do awk -F, 'NR==FNR{a[NR]=$0;next}{printf"%s",$1;for(i=2;i<=NF;i++)printf",%.6f",$i*a[i-1]^.5;print""}' $x.$y $x.$y>$x.scaled;done

plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .1 --out $x
plink --bfile $x --extract $x.prune.in --remove <(grep ^REMOVED\ outlier $x.smartpca|cut -d' ' -f3|cut -d: -f2|awk 'NR==FNR{a[$0];next}$2 in a' - $x.fam) --make-bed --out $x.pruned
k=3
admixture -j4 -C .1 $x.pruned.bed $k
paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.pruned.fam) $x.pruned.$k.Q>$x.$k
cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave

Making ADMIXTURE run faster

ADMIXTURE runs in a single-threaded mode by default, but if your processor has 4 cores, you can add the option `-j4` to make ADMIXTURE run 4 times faster.

With the option `-C .1`, the ADMIXTURE run typically finishes about 2-4 times faster. It terminates the run after the log likelihood increases by less than .1 between iterations (instead of the default .0001). When I tried the same run with and without the `-C .1` option, the average difference in admixture weights was only about .15%:

Code:
$ Rscript -e 'mean(abs(as.matrix(read.table("uralaltaic.pruned.3.Q")-read.table("uralaltaic2.pruned.3.Q"))))'
[1] 0.001502451
Performing more aggressive LD pruning also makes ADMIXTURE run faster. An example in the ADMIXTURE manual uses a fairly aggressive R^2 threshold (.1) for the `--indep-pairwise` option of PLINK (http://dalexander.github.io/admixtur...ure-manual.pdf):

### 2.3 Do I need to thin the marker set for linkage disequilibrium?

We tend to believe this is a good idea, since our model does not explicitly take LD into consideration, and since enormous data sets take more time to analyze. It is impossible to "remove" all LD, especially in recently-admixed populations, which have a high degree of "admixture LD". Two approaches to mitigating the effects of LD are to include markers that are separated from each other by a certain genetic distance, or to thin the markers according the observed sample correlation coefficients. The easiest way is the latter, using the `--indep-pairwise` option of PLINK. For example, if we start with a file `rawData.bed`, we could use the following commands to prune according to a correlation threshold and store the pruned dataset in `prunedData.bed`:

> `% plink --bfile rawData --indep-pairwise 50 10 0.1`
>
> (output indicating number of SNPs targeted for inclusion/exclusion)
>
> `% plink --bfile rawData --extract plink.prune.in --make-bed --out prunedData`

Specifically, the first command targets for removal each SNP that has an _R²_ value of greater than 0.1 with any other SNP within a 50-SNP sliding window (advanced by 10 SNPs each time). The second command copies the remaining (untargetted) SNPs to `prunedData.bed`.

This approach is imperfect but seems to work well in practice. Please read our paper for more information.

Sources of the samples

Code:
$ curl -sO https://reichdata.hms.harvard.edu/p...4/V44.3/SHARE/public.dir/v44.3_HO_public.anno
$ awk -F\\t 'NR==FNR{a[$0];next}$8 in a&&!a[$4,$8]++{print$8,$4}' pop v44.3_HO_public.anno|sort|awk '{a[$2]=a[$2]" "$1}END{for(i in a)print i":"a}'|sort
BergstromScience2020: Daur.SDG Oroqen.SDG Xibo.SDG Yakut.SDG
FlegontovNature2019: Enets Nganasan Selkup
Jeodaghntureachologyevolution20l9: Altaian Altaian_Chelkan Azeri Bashkir Besermyan Buryat Chuvash Evenk_FarEast Evenk_Transbaikal Gagauz Karachai Karakalpak Karelian Kazakh Khakass Khakass_Kachin Kyrgyz_Tajikistan Mongol Mordovian Nanai Negidal Nogai_Astrakhan Nogai_Stavropol Shor_Khakassia Shor_mounten Tatar_Kazan Tatar_Mishar Tatar_saiberian Tatar_saiberian_Zabolotniye Todzin Tubalar Tuvinian Udmurt Uzbek Veps
LazaridisNature2014: Altaian Azeri.WGA Balkar Chuvash Dolgan Even Finnish Hungarian Kalmyk Kumyk Kyrgyz_Kyrgyzstan Mansi Mordovian Nganasan Nogai_Karachay_Cherkessia Selkup Tofalar Tubalar Turkish Turkish_Balikesir Turkmen Tuvinian Uzbek Yukagir_Forest Yukagir_Tundra
MallickNature2016: Daur.DG Estonian.DG Even.DG Finnish.DG Hungarian.DG Kyrgyz_Kyrgyzstan.DG Oroqen.DG Saami.DG Turkish.DG Uyghur.DG Xibo.DG Yakut.DG
PattersonGenetics2012: The Uyghur Xibo Yakut Cycle
WangBioRxiv2020: Kazakh_China Kyrgyz_China Salar

Plotting the output of SmartPCA with Voronoi tessellation

The script below uses the `ggforce` package to plot the result of SmartPCA using Voronoi tessellation: https://ggforce.data-imaginist.com/r...om_delvor.html.

When I tried to install `ggforce`, it failed at first because its dependency `units` failed to install. By running `install.packages("units")`, I saw that I needed to run `brew install udunits` (`libudunits2-dev` on Debian and `udunits2-devel` on RPM).

Code:
library(tidyverse)
library(ggforce)

n="uralaltaic"
t=read.csv(paste0(n,".csv"),header=F,row.names=1)
eig=head(as.double(readLines(paste0(n,".eval"))),ncol(t))
pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

names(t)=paste0("PC",1:ncol(t))
ranges=apply(t,2,function(x)abs(max(x)-min(x)))

pop=sub(":.*","",rownames(t))
pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
centers=aggregate(t,by=list(pop),mean)
t$pop=pop

set.seed(1234)
color=as.factor(sample(seq(1,length(unique(t$pop)))))
col=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60,100),c(20,50),c(15,40))
hues=max(ceiling(length(color)/nrow(col)),2)
pal1=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],x[1],x[2])))
pal2=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],ifelse(x[2]>50,.8*x[1],.2*x[1]),ifelse(x[2]>50,.3*x[2],100))))

for(i in seq(1,7,2)){
p1=sym(paste0("PC",i))
p2=sym(paste0("PC",i+1))

maxrange=max(ranges[c(i,i+1)])
lims=sapply(c(i,i+1),function(x)mean(range(t[,x]))+c(-.535*maxrange,.535*maxrange))

ggplot(t,aes(!!p1,!!p2,group=-1L))+ # the group needs to be constant for geom_voronoi_tile; -1L is a convention
geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=maxrange/40)+
geom_label(data=centers,aes(x=!!p1,y=!!p2,label=Group.1,color=color,fill=color),alpha=.7,size=2,label.r=unit(.08,"lines"),label.padding=unit(.08,"lines"),label.size=0)+
coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
labs(x=pct,y=pct[i+1])+
scale_x_continuous(breaks=seq(-1,1,.05))+
scale_y_continuous(breaks=seq(-1,1,.05))+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal2)+
theme(
axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
axis.text=element_text(color="black",size=6),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"pt"),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray85",fill=NA,size=.4),
panel.grid.major=element_line(color="gray85",size=.2)
)

ggsave(paste0(i,".png"),width=7,height=7)
}

Plotting population averages in SmartPCA output

The clusting is based on hierarchical clustering using the complete linkage method (farthest neighbor clustering). The clustering is based on only the first 6 eigenvectors, where the eigenvectors have been multiplied by the square roots of the eigenvalues.

Code:
library(tidyverse)
library(ggforce)
library(ggrepel)

f="uralaltaic"
t=read.table(paste0(f,".evec"),row.names=1)
t=t[,-ncol(t)]
eig=as.double(readLines(paste0(f,".eval")))[1:ncol(t)]

pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ind=read.table("path/to/v44.3_HO_public.ind",row.names=1)

names(t)=c(paste0("PC",1:ncol(t)))
rownames(t)=sub(".*:","",rownames(t))
t=data.frame(aggregate(t,list(sub("\\.(DG|SDG|SG)","",ind[rownames(t),2])),mean),row.names=1)

t$k=cutree(hclust(dist(t(t(t)*sqrt(eig))[,1:6])),k=16)

ggplot(t,aes(x=PC1,y=PC2))+
geom_point(aes(color=as.factor(k)),size=.5)+
ggforce::geom_mark_hull(aes(color=as.factor(k),fill=as.factor(k)),concavity=100,radius=unit(.2,"cm"),expand=unit(.2,"cm"),alpha=.2,size=.2)+
# geom_polygon(data=t%>%group_by(k)%>%slice(chull(PC1,PC2)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.25)+
ggrepel::geom_text_repel(aes(label=rownames(t),color=as.factor(k)),max.overlaps=Inf,force=5,size=2,box.padding=.0,min.segment.length=.1,segment.size=.2)+
# geom_text(aes(label=rownames(t),color=as.factor(k)),size=2,vjust=-.6)+
scale_x_continuous(breaks=seq(-2,2,.02),expand=expansion(mult=.10))+
scale_y_continuous(breaks=seq(-2,2,.02),expand=expansion(mult=.06))+
labs(x=pct[1],y=pct[2])+
scale_color_manual(values=hcl(head(seq(0,360,length.out=length(unique(t$k))+1),-1),95,50))+
scale_fill_manual(values=hcl(head(seq(0,360,length.out=length(unique(t$k))+1),-1),120,60))+
theme(
aspect.ratio=1,
axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
axis.text=element_text(color="black",size=6),
axis.ticks.length=unit(0,"pt"),
axis.ticks=element_blank(),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.4),
panel.grid.major=element_line(color="gray80",size=.2),
plot.background=element_rect(fill="white")
)

ggsave("a.png",width=6,height=6)

Making a heatmap of population averages in an ADMIXTURE run

Code:
library(pheatmap)
library(colorspace) # for hex
library(vegan) # for reorder.hclust

t=read.table("uralaltaic.3.ave",row.names=1,header=F)

# t=t[,c(1,3,2)]

t=data.frame(aggregate(t,list(sub("\\.(DG|SDG|SG|WGA)","",rownames(t))),mean),row.names=1)

pheatmap (
100*t,
filename="a.png",
cluster_cols=F,
show_colnames=F,
clustering_callback=function(...)reorder(hclust(dist(t)),(1:ncol(t))%*%t(t)),
legend=F,
treeheight_row=80,
cellwidth=16,
cellheight=16,
fontsize=10,
border_color=NA,
display_numbers=T,
number_format="%.0f",
fontsize_number=8,
number_color="black",
colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0),c(0,.5,.5,.5,.5,.5,.5),1)))(256)
)

Making a ternary diagram of an ADMIXTURE run with 3 components

Plot population averages with clustering based on a second K=8 ADMIXTURE run:

Code:
library(tidyverse)
library(ggforce)
library(ggrepel)

f="uralaltaic"
t=read.table(paste0(f,".3.ave"),row.names=1)

t=data.frame(aggregate(t,list(sub("\\.(DG|SDG)","",rownames(t))),mean),row.names=1)

tri=rbind(c(0,sqrt(3)/3),c(-.5,-sqrt(3)/6),c(.5,-sqrt(3)/6))
xy=as.data.frame(as.matrix(t)%*%tri)

grid=apply(rbind(c(1,2,3,2),c(1,3,2,3),c(2,1,3,1)),1,function(x)cbind(
seq(tri[x[1],1],tri[x[2],1],length.out=11),
seq(tri[x[1],2],tri[x[2],2],length.out=11),
seq(tri[x[3],1],tri[x[4],1],length.out=11),
seq(tri[x[3],2],tri[x[4],2],length.out=11)
)%>%as.data.frame)%>%bind_rows

k8=read.table(paste0(f,".8.ave"),row.names=1)
xy$k=as.factor(cutree(hclust(dist(k8[rownames(t),])),k=10))

expand=.04
xlim=c(-.5-expand,.5+expand)
ylim=c(-sqrt(3)/6-expand,sqrt(3)/3+expand)

ggplot(xy,aes(x=V1,y=V2))+
# geom_polygon(data=as.data.frame(tri),fill="gray95")+ # uncomment to use a gray background for the triangle
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray85",size=.3)+
geom_mark_hull(aes(group=k,color=k,fill=k),concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.2)+
geom_point(aes(color=k),size=.5)+
geom_text_repel(aes(label=rownames(xy),color=k),max.overlaps=Inf,force=5,size=2.3,box.padding=.05,segment.size=.3,min.segment.length=.15)+
coord_fixed(expand=F,xlim=xlim,ylim=ylim)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white"),
plot.margin=margin(0,0,0,0)
)

ggsave("a.png",width=7,height=7*diff(ylim)/diff(xlim))

Plot individual samples with Voronoi tessellation:

Code:
library(tidyverse)
library(ggforce)

t=read.table("uralaltaic.3")
rownames(t)=paste0(t[,2],":",t[,1])
t=t[,-c(1,2)]

tri=rbind(c(0,sqrt(3)/3),c(-.5,-sqrt(3)/6),c(.5,-sqrt(3)/6))
xy=as.data.frame(as.matrix(t)%*%tri)

grid=apply(rbind(c(1,2,3,2),c(1,3,2,3),c(2,1,3,1)),1,function(x)cbind(
seq(tri[x[1],1],tri[x[2],1],length.out=11),
seq(tri[x[1],2],tri[x[2],2],length.out=11),
seq(tri[x[3],1],tri[x[4],1],length.out=11),
seq(tri[x[3],2],tri[x[4],2],length.out=11)
)%>%as.data.frame)%>%bind_rows

pop=sub(":.*","",rownames(xy))
pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
centers=aggregate(xy,by=list(pop),mean)
xy$pop=pop

set.seed(1234)
color=as.factor(sample(seq(1,length(unique(xy$pop)))))
col=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60,100),c(20,50),c(15,40))
hues=max(ceiling(length(color)/nrow(col)),2)
pal1=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],x[1],x[2])))
pal2=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],ifelse(x[2]>50,.8*x[1],.2*x[1]),ifelse(x[2]>50,.3*x[2],100))))

# add a small random factor to the coordinates; without this geom_voronoi_tile can fail because of too many overlapping points
xy$V1=xy$V1+runif(nrow(xy))/1e5
xy$V2=xy$V2+runif(nrow(xy))/1e5

expand=.05
xlim=c(-.5-expand,.5+expand)
ylim=c(-sqrt(3)/6-expand,sqrt(3)/3+expand)

ggplot(xy,aes(x=V1,y=V2,group=-1L))+ # the group needs to be a constant for geom_voronoi_tile; -1L is a convention
# geom_polygon(data=as.data.frame(tri),fill="gray95")+
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray85",size=.3)+
geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.025)+
geom_label(data=centers,aes(x=V1,y=V2,label=Group.1,color=color,fill=color),alpha=.7,size=2,label.r=unit(.07,"lines"),label.padding=unit(.07,"lines"),label.size=0)+
coord_fixed(expand=F,xlim=xlim,ylim=ylim)+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal2)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white"),
plot.margin=margin(0,0,0,0,"in")
)

ggsave("a.png",width=7,height=7*diff(ylim)/diff(xlim))

Calculating FST distances with SmartPCA

When the `fstonly: YES` parameter is specified, a matrix of FST values is printed to STDOUT as integers multiplied by thousand. `phylipoutname: fstfilename` saves an FST matrix to a file, but in the file the FST values have only three digits after the decimal point. The `fsthiprecision: YES` parameter causes the FST values that are printed to STDOUT to be multiplied by million instead of thousand, but it doesn't affect the contents of the `phylipoutname` file.

In order to return FST values, the sixth field of the fam file needs to contain integers that identify population groups: https://www.biostars.org/p/266511/. The commands below use integers starting from 10 as group identifiers, because the numbers 1, 2, and 9 have a special meaning (1 assigns the line as a case, 2 assigns it as a control, and 9 ignores it).

If an FST run includes more than 100 populations, SmartPCA exits with an error unless you include a parameter like `maxpops: 1000`.

Code:
x=uralic
printf %s\\n Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Saami.DG Selkup Udmurt Veps>$x.pop
sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' pop ->$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
awk '!a[$2]++{i++}{print$1,i}' $x.pick|awk 'NR==FNR{a[$1]=$2;next}{$6=a[$2]+9}1' - $x.fam>$x.famtemp;mv $x.fam{temp,}
smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam fstonly:\ YES phylipoutname:\ $x.fst fsthiprecision:\ YES)|tee $x.smartpca
p=$(cut -d' ' -f2 $x.pick|awk '!a[$0]++');sed -n '/fst \*1000000/,/^$/p' $x.smartpca|sed 1,2d|sed \$d|tr -s ' ' ,|cut -d, -f3-|paste -d, <(echo "$p") -|cat <(printf %s\\n '' "$p"|paste -sd,) ->$x.fst
I don't know if you're supposed to do LD pruning before calculating FST, because Kerminen et al. 2021 said this: "We calculated pairwise-FST between the reference groups (Fig 2) and the ancestor candidate groups (S9 Fig) using SmartPCA of EIGENSOFT package[7] (fstonly: YES, fsthiprecision: YES) and 56,661 LD-independent variants."

Making an FST heatmap

The default clustering method used by `pheatmap` is equivalent to `hclust(dist(t))`, but here the input is already a distance matrix, so you can use this instead: `clustering_callback=function(...)hclust(as.dist(t ))`. The code below additionally uses `vegan::reorder.hclust` to reorder the branches of the heatmap.

Code:
library(pheatmap)
library(colorspace) # for hex
library(vegan) # for reorder.hclust

t=read.csv("uralic.fst",row.names=1)

# recalculate population averages with suffixes like `.DG` removed
# tav=function(x,y)data.frame(aggregate(x,list,mean),row.names=1)
# rn=sub("\\.(DG|SDG|SG|WGA)","",rownames(t))
# t=tav(t(tav(t,rn)),rn)

# rand=sample(nrow(t),12);t=t[rand,rand]

# find population pair with highest distance and calculate weights for sorting clusters
max=which(t==max(t))[1]
weight=t[,max%%nrow(t)]-t[,max%/%nrow(t)+1]

pheatmap (
t/1000,
filename="a.png",
legend=F,
clustering_callback=function(...)reorder(hclust(as.dist(t)),weight),
# clustering_callback=function(...)reorder(hclust(as.dist(t)),cmdscale(as.dist(t))[,1]), # sort by first dimension in MDS
# clustering_callback=function(...)reorder(hclust(as.dist(t)),rowMeans(t)), # sort by average distance to other populations
# clustering_callback=function(...)hclust(as.dist(t)),
cellwidth=18,
cellheight=18,
fontsize=9,
treeheight_row=80,
treeheight_col = 80,
border_color=NA,
display_numbers=T,
number_format="%.0f",
fontsize_number=9,
number_color="black",
colorRampPalette(hex(HSV(c(210,210,120,60,45,30,15,0),c(0,.5,.5,.5,.5,.5,.5,.5),1)))(256)
)

Compiling EIGENSOFT on macOS

The original version of PLINK on the website hosted by Harvard is no longer updated, and the Mac binaries on the website by Harvard didn't work on my computer. However the website of the new maintainer had working Mac binaries: https://www.cog-genomics.org/plink/. The code in this thread uses PLINK 1.9 and not PLINK 2.

There's Mac and GNU/Linux binaries for ADMIXTURE here: https://github.com/NovembreLab/admixture.

There are Mac binaries for an old version of EIGENSOFT here: https://github.com/chrchang/eigensoft. However for example the old version of SmartPCA is missing the `lsqproject` and `autoshrink` options. Therefore I compiled a newer version of EIGENSOFT manually: https://www.hsph.harvard.edu/alkes-price/software/.

I first ran `brew install gsl homebrew/science/openblas` as suggested in the README. It failed before I ran `brew uninstall --ignore-dependencies suite-sparse`.

I initially got errors like this because the C code included implicit function declarations:

Code:
$ cd EIG-7.2.1/src
$ make CFLAGS=' -I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
cc -I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include -I../include -I/usr/include/openblas -c -o convertf.o convertf.c
convertf.c:257:5: error: implicit declaration of function 'setfamilypopnames' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
setfamilypopnames (YES);
^
1 error generated.
make: *** [convertf.o] Error 1

I was able to compile everything successfully by adding function prototypes to the source files manually. For example I added this to `src/eigensrc/smartpca.c`:

Code:
void setid2pops (char *id2pops, Indiv **indivmarkers, int numindivs);
void kjg_fpca (int numeigs, int fastdim, int fastiter, double *evals, double *evecs);

However I later found an easier way to compile EIGENSOFT on Mac. I had earlier tried to run `make CC=gcc`, but it used `/usr/bin/gcc` which is actually `clang` on macOS. `brew install gcc` installs a real GCC to `/usr/local/bin/gcc-10`.

First I ran `cd EIG-7.2.1/src/nicksrc;make`, because without it I got an error like this:

Code:
gauss.c:111 fatal error: ranmath.h: No such file or directory
1 | #include "ranmath.h"
| ^~~~~~~~~~~
compilation terminated.
make[1]: *** [gauss.o] Error 1
make: *** [nicksrc/libnick.a] Error 2
I don't know why it worked, but I was then able to compile everything successfully by first running `make` by using GCC with the `-k` option (`make -k CC=gcc-10`) and then by using Clang (`make`).

Short version:

Code:
# brew uninstall --ignore-dependencies suite-sparse
# brew install gcc gsl homebrew/science/openblas
wget github.com/DReichLab/EIG/archive/v7.2.1.tar.gz
tar -xf EIG-7.2.1.tar.gz
cd EIG-7.2.1/src/nicksrc
make
cd ..
make -k CC=/usr/local/bin/gcc-10 CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
make CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
I uploaded Mac binaries for EIGENSOFT here: https://drive.google.com/file/d/1H8k...ew?usp=sharing.

Archetype0ne · May 30, 2021

Here's a series of K=3 ADMIXTURE runs visualized as ternary diagrams. The clustering is based on a distance matrix generated from the admixture percentages in a K=6 run. I included only samples with the suffix ".DG" or ".SDG". I removed duplicate samples based on the version ID. For pairs of samples with PI_HAT of .25 or higher, I excluded the sample whose name was alphabetically first. I included at most two samples per population, because most of the populations with the suffix ".DG" only include one or two samples, and I didn't want other populations to be overrepresented.

In the first run, Maasai have 28% of the West Eurasian component, but only Capoids and Bambutids have 100% of the African component when rounded to the nearest integer. Iranians, Finns, Turks, and Adygei all have 6% of the East Eurasian component. Papuans have 72% of the East Eurasian component and 12% of the African component. Ust'-Ishim has 15% of the African component.

In the second run, I removed SSAs and Saharans (Saharawi and Mozabite). Saami and Maaori are now outliers who are in their own cluster with no other member. The bottom right component is maximal in early farmers and Southern Europeans, but Northern Europeans additionally have a part of the American component. Mansi and Saami are clearly shifted towards Americans. The sample labeled Mixtec_2 is probably mixed with Europeans, because it has 25% of the West Eurasian component.

The third run doesn't include Australians, Papuans, Americans, or Eskimos. Now Mansi are about half European and half East Asian, Kusunda are about half East Asian and half South Asian, and Makrani are about half South Asian and half European.

The fourth run doesn't include East Asians or Southeast Asians (which I considered to include Mongols and Northeastern Chinese ethnic groups but not Uyghurs). Now Chukchi and Mansi have more of the East Eurasian component than the West Eurasian component, even though in the previous run they had more of the West Eurasian component. (There is only one Chukchi sample which is very western and not representative of unmixed Chukchi.) Iranians are about 29% South Asian and 71% European.

In the fifth run, I removed South Asians and MENAs. I kept Hazara and Tajiks because I considered them to be Central Asian. Now Bulgarians have about half of the Caucasian component and half of the European component. Finns have the second highest proportion of the European component after Loschbour. Altaians have about 22% of the Caucasian component and 78% of the Siberian component.

Below are also ternary plots for European ADMIXTURE runs with 3 components.

The first run includes basically all European populations in the 1240K+HO dataset, except I didn't include Kalmyks or Nenetses, and I didn't include populations with the suffix ".SG" or ".WGA". I included at most 16 samples per population. The first run is interesting, because Lithuanians and Basques are almost equally close to the top of the triangle, so the top component seems like a WHG-like component. Finns have about 20-30% of the Nogai component and about 70-80% of the top component. Mordvins have on average 13% of the Caucasian component but Finns have 7%. In qpAdm and Vahaduo models I have made of Uralic peoples, Mordvins have also received more CHG ancestry than Baltic Finns.

The second run doesn't include North Caucasians, Nogais, or Bashkirs. Now the samples that are intermediate between the Volga-Ural region and Southern Europe include not only Gagauzes but also Moldovans.

The script below makes a stacked bar chart of population averages in an ADMIXTURE run.

You can use the program `pong` to make a graph with thin bars for each individual: https://link.springer.com/protocol/1...-0716-0199-0_4, https://github.com/ramachandran-lab/pong.

Code:
library(tidyverse)
library(colorspace)
library(reshape2)

t=read.table("euro.4.ave")

t=t[,c(1,4,3,5,2)]

t=aggregate(t[,-1],list(sub("\\.(DG|SDG)","",t[,1])),mean)
names(t)[1]="V1"

t=t[order(as.matrix(t[,-1])%*%seq(ncol(t)-1)^2),]

t2=reshape2::melt(t,id.vars="V1") # this uses melt because pivot_longer doesn't preserve the order of columns

lab=round(100*t2$value)
lab[lab==0]=""

ggplot(t2,aes(x=fct_rev(factor(V1,level=t$V1)),y=value,fill=variable))+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
geom_text(aes(label=lab),position=position_stack(vjust=.5,reverse=T),size=3.5)+
coord_flip()+
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))+
scale_fill_manual(values=colorspace::hex(HSV(c(300,210,45,20),.5,1)))+ # manual colors
# scale_fill_manual(values=colorspace::hex(HSV(head(seq(0,360,length.out=ncol(t)),-1),.5,1)))+ # automatic colors
theme(
axis.text=element_text(color="black",size=11),
axis.text.x=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none"
)

ggsave("a.png",width=5,height=.2*nrow(t)+.3)

This time I didn't exclude samples based on PI_HAT, because there was no pair of samples which had a PI_HAT value of .25 or above after LD pruning.

Code:
curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
x = euro
printf %s\\n Abazin Adygei Adygei.SDG Albanian Avar Balkar Bashkir Basque Basque.SDG Belarusian Besermyan Bulgarian Chechen Chuvash Circassian Cretan.DG Croatian Czech Darginian English Estonian Finnish Finnish.DG French French.SDG Gagauz Greek Hungarian Icelandic Ingushian Italian_North Italian_South Jew_Ashkenazi Kabardinian Kaitag Kalmyk Karachai Karelian Kumyk Lak Lezgin Lezgin.DG Lithuanian Maltese Moldavian Mordovian Nogai_Astrakhan Nogai_Karachay_Cherkessia Nogai_Stavropol Norwegian Norwegian.DG Orcadian Orcadian.SDG Ossetian Polish.DG Romanian Russian Russian.SDG Russian_Archangelsk_Krasnoborsky Russian_Archangelsk_Leshukonsky Russian_Archangelsk_Pinezhsky Saami.DG Sardinian Scottish Sicilian Spanish Spanish_North Tabasaran Tatar_Kazan Tatar_Mishar Udmurt Ukrainian Ukrainian_North Veps>$x.pop
sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' $x.pop ->$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .05 --out $x;plink --bfile $x --extract $x.prune.in --make-bed --out $x.pruned
k=4;admixture -j4 -C .1 $x.pruned.bed $k
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++)a[$1,i]+=$i}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n);print o}}' "FS=${1-$'\t'}")
paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.pruned.fam) $x.pruned.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave

You can get reasonable ADMIXTURE results even with ancient samples if you only include samples with a high SNP count:

In the run above, I selected samples with over 500,000 SNPs, latitude over 30, longitude over -30, and years BP over 5,000. I removed duplicate samples by selecting unique samples by the version ID in the third field of the anno file.

Code:
curl -sO https://reichdata.hms.harvard.edu/p...4/V44.3/SHARE/public.dir/v44.3_HO_public.anno
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref')
awk -F\\t '$15>5e5&&$11>30&&$12>-30&&$12<180&&$6>5000&&!a[$3]++{print$2,$8}' v44.3_HO_public.anno|igno

Archetype0ne · May 30, 2021

I now figured out how to combine a stacked bar chart with a dendrogram:

Code:
library(tidyverse)
library(ggdendro)
library(vegan)
library(colorspace)

t=read.csv("https://pastebin.com/raw/CAvFKS5L",row.names=1,header=F)
fst=read.csv("https://pastebin.com/raw/d87iA1NH",row.names=1)

t=t[,c(3,2,4,5,1,6)]
names(t)=paste0("V",1:ncol(t))

# do clustering based on FST
fst=fst[rownames(t),rownames(t)]
fst[fst<0]=0
maxfst=which(fst==max(fst))[1]
hc=reorder(hclust(as.dist(fst)),fst[,maxfst%%nrow(fst)]-fst[,maxfst%/%nrow(fst)+1])

# do clustering based on ADMIXTURE
# hc=hclust(dist(t),method="ward.D2")
# hc=reorder(hc,wts=-as.matrix(t)%*%seq(ncol(t))^2)

tree=ggdendro::dendro_data(as.dendrogram(hc),type="triangle")

p1=ggplot(ggdendro::segment(tree))+
geom_segment(aes(x=y,y=x,xend=yend,yend=xend),lineend="round",size=.5)+
scale_x_continuous(expand=expansion(mult=c(0,.01)))+
scale_y_continuous(limits=.5+c(0,nrow(t)),expand=c(0,0))+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"pt"),
axis.title=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank(),
plot.margin=margin(5,5,5,0)
)

t=t[hc$labels[hc$order],]
t2=data.frame(V1=rownames(t)[row(t)],V2=colnames(t)[col(t)],V3=unname(do.call(c,t)))
lab=round(100*t2$V3)
lab[lab==0]=""

p2=ggplot(t2,aes(x=factor(V1,level=rownames(t)),y=V3,fill=V2))+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
geom_text(aes(label=lab),position=position_stack(vjust=.5,reverse=T),size=3.5)+
coord_flip()+
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))+
scale_fill_manual(values=hex(HSV(c(40,180,90,140,270,330),.5,1)))+
# scale_fill_manual(values=colorspace::hex(HSV(head(seq(0,360,length.out=ncol(t)+1),-1),.5,1)))+
theme(
axis.text=element_text(color="black",size=11),
axis.text.x=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
plot.margin=margin(5,0,5,5)
)

cowplot:

lot_grid(p2,p1,rel_widths=c(1,.4))
ggsave("a.png",height=.25*nrow(t),width=7)

Below the clustering is based on an FST matrix generated by SmartPCA. The branches were sorted by first finding the pair of populations with the highest FST distance, which was Tofalars and Gagauzes, and then by arranging branches closer to Tofalars to the top and branches closer to Gagauzes to the bottom.

BTW you can use the code below to manually find outliers in a SmartPCA run. It finds samples whose position on the first 8 eigenvectors is 3 SDs or more away from zero.

Code:
> t=read.table("smartpcarun.evec",row.names=1,skip=1)
> pop=read.table("v44.3_HO_public.ind",row.names=1)
> ind=sub(".*:","",rownames(t))
> lab=paste0(pop[ind,]$V3,":",ind)
> lab[which(apply(abs(scale(t[,1:8]))>=3,1,any))]
[1] "Altaian:ALT-876" "Gagauz:GAG-220"
[3] "Gagauz:GAG-226" "Shor_Khakassia:KHS-001"
[5] "Kazakh:KZH-1614" "Kazakh:KZH-1722"
[7] "Shor_Mountain:Shor-262" "Tubalar:Tuba21"
[9] "Tubalar:Tuba6" "Tubalar:Tuba24"
[11] "Tubalar:Tuba9" "Tubalar:Tuba2"
[13] "Turkish_Balikesir:Balikesir16790" "Turkish:Adana23108"
[15] "Turkish:Adana23136" "Salar:SL1"
[17] "Salar:SL2" "Salar:SL3"
[19] "Salar:SL6" "Salar:SL7"
[21] "Salar:SL9" "Salar:SL15"

Archetype0ne · May 30, 2021

The script in this post plots the output of SmartPCA with convex hulls drawn around the samples from each population. It uses a color palette with three different saturation-brightness combinations. It simultaneously makes plots for the first 8 PCs. It uses `ggforce::geom_mark_hull` to draw convex hulls with rounded corners, but you can change `concavity=100` to `concavity=3` to draw concave hulls instead. The plot area is always square so you can tile multiple plots nicely.

Code:
library(tidyverse)
library(ggforce)
library(ggrepel)
library(colorspace)

f="smartpcarunprefix"
t=read.csv(paste0(f,".csv"),header=F,row.names=1)
names(t)=paste0("PC",1:ncol(t))
eig=head(as.double(readLines(paste0(f,".eval"))),ncol(t))
pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

t=t[-which(apply(abs(scale(t[,1:8]))>3,1,any)),] # remove outliers whose position is 3 or more SDs away from zero on one of the first eight PCs

ranges=apply(t,2,function(x)abs(max(x)-min(x)))

pop=sub(":.*","",rownames(t))
pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
centers=aggregate(t,by=list(pop),mean)
t$pop=pop

set.seed(0)
color=as.factor(sample(seq(1,length(unique(t$pop)))))
col = rbind (c (.5, .9), c (.25,1), c (.3, .8))
hues=max(ceiling(length(color)/nrow(col)),2)
pal1=as.vector(apply(col,1,function(x)hex(HSV(seq(0,360,length=hues+1)[1:hues],x[1],x[2]))))
pal2=as.vector(apply(col,1,function(x)hex(HSV(rep(0,hues),0,ifelse(x[2]>=.7,0,1)))))

for(i in seq(1,7,2)){
p1=sym(paste0("PC",i))
p2=sym(paste0("PC",i+1))

maxrange=max(ranges[c(i,i+1)])
lims=sapply(c(i,i+1),function(x)mean(range(t[,x]))+c(-.52*maxrange,.52*maxrange))

ggplot(t,aes(!!p1,!!p2))+
ggforce::geom_mark_hull(aes(group=pop),color=pal1[color[as.factor(pop)]],fill=pal1[color[as.factor(pop)]],concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.2)+
geom_point(aes(x=!!p1,y=!!p2),color=pal1[color[as.factor(t$pop)]],size=.3)+
geom_point(data=centers,aes(x=!!p1,y=!!p2),color=pal1[color],size=2)+
geom_label_repel(data=centers,aes(x=!!p1,y=!!p2,label=Group.1),color=pal2[color],fill=pal1[color],force=3,size=2.2,label.r=unit(.1,"lines"),label.padding=unit(.1,"lines"),label.size=0,max.overlaps=Inf,box.padding=.15,min.segment.length=0,segment.size=.4,segment.color=pal1[color],alpha=.8,direction="y")+
labs(x=pct,y=pct[i+1])+
coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
scale_x_continuous(breaks=seq(-1,1,.05))+
scale_y_continuous(breaks=seq(-1,1,.05))+
theme(
axis.text.x=element_text(margin=margin(.2,0,0,0,"cm")),
axis.text.y=element_text(angle=90,vjust=1,hjust=.5,margin=margin(0,.2,0,0,"cm")),
axis.text=element_text(color="black",size=6),
axis.ticks.length=unit(-.13,"cm"),
axis.ticks=element_line(size=.2,color="black"),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="gray35"),
panel.border=element_rect(color="black",fill=NA,size=.4),
panel.grid=element_blank(),
plot.background=element_rect(fill="gray35")
)

f=paste0("pc",i,".png")
ggsave(f,width=7,height=7)
}

Archetype0ne · May 30, 2021

You can do a radial dendrogram based on an FST matrix like this. For a regular vertical dendrogram, just remove `type="fan"`. `font=1` disables using an italic font and `underscore=T` disables replacing underscores with spaces.

Code:
library(vegan) # for reorder.hclust
library(ape) # for plot.phylo

t=read.csv("https://pastebin.com/raw/d87iA1NH",row.names=1)
t[t<0]=0

hc=hclust(as.dist(t))

max=which(t==max(t))[1]
hc=reorder(hc,t[,max%%nrow(t)]-t[,max%/%nrow(t)+1])

png("a.png",w=1200,h=1200)
plot(as.phylo(hc),cex=1.5,font=1,underscore=T,type="fan")
dev.off()

Archetype0ne · May 30, 2021

The scripts in this post plot population averages in the output of SmartPCA, where each population is connected with a line to the three populations it has the lowest FST distance to. The first script draws hulls around the populations based on the FST matrix, and the second script sets the color of each population based on its FST distance to a specified population.

Code:
library(tidyverse)
library(cowplot)

f="smartpcarunprefix"
t=read.csv(paste0(f,".ave"),header=F,row.names=1)
names(t)=paste0("PC",1:ncol(t))
eig=head(as.double(readLines(paste0(f,".eval"))),ncol(t))
pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ranges=apply(t,2,function(x)abs(max(x)-min(x)))

fst=read.csv(paste0(f,".fst"),row.names=1)[rownames(t),rownames(t)]
t$k=cutree(hclust(as.dist(fst)),k=10)

set.seed(0)
color=as.factor(sample(seq(1,length(unique(t$k)))))
col = rbind (c (80,80), c (40,100), c (90,80))
hues=max(ceiling(length(color)/nrow(col)),10)
pal1=as.vector(apply(col,1,function(x)hcl(seq(0,360,length=hues+1)[1:hues],x[1],x[2])))

plots=lapply(seq(1,3,2),function(xpc){
seg0=lapply(1:3+1,function(i)apply(fst,1,function(x)unlist(t[names(sort(x)),c(xpc,xpc+1)],use.names=F))%>%t%>%cbind(t[,c(xpc,xpc+1)]))
seg=do.call(rbind,seg0)%>%setNames(paste0("V",1:4))

maxrange=max(ranges[c(xpc,xpc+1)])
lims=sapply(c(xpc,xpc+1),function(x)mean(range(t[,x]))+c(-.52*maxrange,.52*maxrange))

p1=sym(paste0("PC",xpc))
p2=sym(paste0("PC",xpc+1))

ggplot(t,aes(!!p1,!!p2))+
ggforce::geom_mark_hull(aes(group=k),color=pal1[color[as.factor(t$k)]],fill=pal1[color[as.factor(t$k)]],concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.2)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray70",size=.2)+
geom_point(aes(x=!!p1,y=!!p2),color=pal1[color[as.factor(t$k)]],size=1)+
geom_text(aes(x=!!p1,y=!!p2,label=rownames(t)),vjust=-.7,color=pal1[color[as.factor(t$k)]],size=2.2)+
labs(x=pct[xpc],y=pct[xpc+1])+
coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
scale_x_continuous(breaks=seq(-1,1,.02))+
scale_y_continuous(breaks=seq(-1,1,.02))+
theme(
axis.text=element_text(color="black",size=6),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"pt"),
axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="gray40"),
panel.border=element_rect(color="gray30",fill=NA,size=.4),
panel.grid.major=element_line(color="gray33",size=.2),
panel.grid.minor=element_blank(),
plot.background=element_rect(fill="gray40",color=NA)
)
})

plot_grid(plotlist=plots)
ggsave("a.png",height=6,width=12)

Code:
library(tidyverse)
library(cowplot)

f="smartpcarunprefix"
t=read.csv(paste0(f,".ave"),header=F,row.names=1)
names(t)=paste0("PC",1:ncol(t))
eig=head(as.double(readLines(paste0(f,".eval"))),ncol(t))
pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ranges=apply(t,2,function(x)abs(max(x)-min(x)))

fst=read.csv(paste0(f,".fst"),row.names=1)[rownames(t),rownames(t)]
target="Mansi"
dist=fst[,target]/1e6

plots=lapply(seq(1,3,2),function(xpc){
seg0=lapply(1:3+1,function(i)apply(fst,1,function(x)unlist(t[names(sort(x)),c(xpc,xpc+1)],use.names=F))%>%t%>%cbind(t[,c(xpc,xpc+1)]))
seg=do.call(rbind,seg0)%>%setNames(paste0("V",1:4))

maxrange=max(ranges[c(xpc,xpc+1)])
lims=sapply(c(xpc,xpc+1),function(x)mean(range(t[,x]))+c(-.53*maxrange,.53*maxrange))

p1=sym(paste0("PC",xpc))
p2=sym(paste0("PC",xpc+1))

ggplot(t,aes(!!p1,!!p2))+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray70",size=.2)+
geom_point(aes(x=!!p1,y=!!p2,color=dist),size=.8)+
geom_text(aes(x=!!p1,y=!!p2,label=rownames(t),color=dist),vjust=-.7,size=2.2)+
labs(x=pct[xpc],y=pct[xpc+1])+
coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
scale_x_continuous(breaks=seq(-1,1,.02))+
scale_y_continuous(breaks=seq(-1,1,.02))+
scale_color_gradientn(colors=hcl(c(240,220,200,170,120,70,30,0,320,280)+15,100,70),name=paste("FST distance to",target),breaks=seq(0,max(dist),.01),limits=c(0,max(dist)))+
guides(color=guide_colorbar(ticks=F))+
theme(
axis.text=element_text(color="black",size=6),
axis.ticks=element_line(size=.2,color="gray60"),
axis.ticks.length=unit(-.13,"cm"),
axis.text.x=element_text(margin=margin(.2,0,0,0,"cm")),
axis.text.y=element_text(angle=90,vjust=1,hjust=.5,margin=margin(0,.2,0,0,"cm")),
axis.title=element_text(color="black",size=8),
legend.direction="horizontal",
legend.justification="left",
legend.key.height=unit(.5,"cm"),
legend.key.width=unit(1,"cm"),
legend.margin=margin(0,0,-7,0),
legend.position=c(.02,.97),
legend.spacing.y=unit(.05,"cm"),
legend.text=element_text(size=6,vjust=.5),
legend.title=element_text(size=8,vjust=.7),
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray60",fill=NA,size=.4),
panel.grid=element_blank()
)
})

plot_grid(plots[[1]],plots[[2]]+theme(legend.position="none"))
ggsave("a.png",height=6,width=12)

Archetype0ne · May 30, 2021

The script below uses the `ComplexHeatmap` package to combine heatmaps for ADMIXTURE runs at multiple K values: https://jokergoo.github.io/ComplexHe...eference/book/.

The clustering is based on a matrix where the matrixes of each run have been joined into a single wide matrix. The clustering tree is ordered based on a K=3 run, where populations with a high proportion of the first component are moved to the top and populations with a high proportion of the third component are moved to the bottom.

Code:
library(ComplexHeatmap)
library(circlize) # for colorRamp2
library(colorspace) # for hex
library(vegan) # for reorder.hclust (may be masked by seriation::reorder_hclust)

quals = c (2,3,4,5,6,8)

# columnorder=lapply(kvals,seq)
columnorder=list(c(2,1),c(3,2,1),c(1,2,4,3),c(3,1,5,4,2),c(3,2,4,5,1,6),c(2,3,1,7,4,8,6,5))

mats=sapply(1:length(kvals),function(i)100*read.table(paste0("turk.",kvals,".ave"),row.names=1)[,columnorder[]])
comb=do.call(cbind,mats)

png("a.png",w=5000,h=5000,res=144)

maps = sapply (kvals, function (k) {
mat=as.matrix(mats[match(k,kvals)][[1]])
Heatmap(
mat,
show_heatmap_legend=F,
show_column_names=F,
show_row_names=F,
clustering_distance_rows="euclidean",
width=ncol(mat)*unit(30,"pt"),
height=nrow(mat)*unit(30,"pt"),
row_dend_width=unit(200,"pt"),
cluster_columns=F,
cluster_rows=reorder(hclust(dist(comb)),mats[[2]][,3]-mats[[2]][,1]),
column_title=paste0("K=",k),column_title_gp=gpar(fontsize=24),
right_annotation=rowAnnotation(text1=anno_text(gt_render(rownames(mat),padding=unit(c(2,2,2,2),"mm")),just="left",location=unit(0,"npc"),gp=gpar(fontsize=17))),
col=colorRamp2(seq(0,100,length.out=7),hex(HSV(c(210,210,130,60,40,20,0),c(0,rep(.5,6)),1))),
cell_fun=function(j,i,x,y,w,h,fill)grid.text(sprintf("%.0f",mat[i,j]),x,y,gp=gpar(fontsize=15))
)
})

draw(Reduce("+",maps))
dev.off()
system("mogrify -gravity center -trim -border 16 -bordercolor white a.png")

I also modified the ternary diagram script I posted earlier, so it can now be used to create a square diagram of a K=4 run or a hexagonal diagram of a K=6 run. However now there is no longer a one-to-one correspondence between the coordinates in the plot and the proportions of admixture components. For example in the square diagram, a sample in the center of the plot can now have either 25% of all four components or 50% of the top component and 50% of the bottom component.

I now also found a good way to use `ggrepel` to avoid overlapping labels by setting `point.size=0` (which doesn't repel the labels away from points but only from other labels).

Code:
library(tidyverse)
library(ggforce)
library(ggrepel)

t=read.table("https://pastebin.com/raw/V5RjCpmk") # K=4
# t=read.table("https://pastebin.com/raw/1j20UVLD") # K=6

row.names(t)=paste0(t[,2],":",t[,1])
t=t[,-c(1,2)]

tri=rbind(c(0,1),c(1,0),c(0,-1),c(-1,0)) # diamond (for K=4)
# tri=rbind(c(1,1),c(1,-1),c(-1,-1),c(-1,1)) # square (for K=4)
# tri=rbind(c(.5,sqrt(3)/2),c(1,0),c(.5,-sqrt(3)/2),c(-.5,-sqrt(3)/2),c(-1,0),c(-.5,sqrt(3)/2)) # hexagon (for K=6)

xy=as.data.frame(as.matrix(t)%*%tri)

grid=apply(rbind(c(1,2,4,3),c(1,4,2,3),c(1,2,1,4),c(3,2,3,4),c(4,1,4,3),c(2,1,2,3)),1,function(x)cbind(
seq(tri[x[1],1],tri[x[2],1],length.out=11),
seq(tri[x[1],2],tri[x[2],2],length.out=11),
seq(tri[x[3],1],tri[x[4],1],length.out=11),
seq(tri[x[3],2],tri[x[4],2],length.out=11)
)%>%as.data.frame)%>%bind_rows

# grid=as.data.frame(rbind(cbind(tri,rbind(tri[-1,],tri[1,])),cbind(tri,matrix(rep(0,12),ncol=2)))) # for hexagon

pop=sub(":.*","",rownames(xy))
pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
centers=aggregate(xy,by=list(pop),mean)
xy$pop=pop

set.seed(1234)
color=as.factor(sample(seq(1,length(unique(xy$pop)))))
col=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60,100),c(20,50),c(15,40))
hues=max(ceiling(length(color)/nrow(col)),8)
pal1=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],x[1],x[2])))
pal2=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],ifelse(x[2]>50,.8*x[1],.2*x[1]),ifelse(x[2]>50,.3*x[2],100))))

# add a small random factor so geom_voronoi_tile won't fail because of too many overlapping points
xy$V1=xy$V1+runif(nrow(xy))/1e4
xy$V2=xy$V2+runif(nrow(xy))/1e4

ggplot(xy,aes(x=V1,y=V2,group=-1L))+
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray90",size=.3)+
geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.05)+
# geom_point(data=centers,aes(x=V1,y=V2,color=color,fill=color),shape=21,size=2)+
# geom_label(data=centers,aes(x=V1,y=V2,label=Group.1,color=color,fill=color),alpha=.7,size=2,label.r=unit(.07,"lines"),label.padding=unit(.07,"lines"),label.size=0)+
geom_label_repel(data=centers,aes(x=V1,y=V2,label=Group.1,color=color,fill=color),size=2.2,label.r=unit(.1,"lines"),label.padding=unit(.1,"lines"),label.size=.1,max.overlaps=Inf,box.padding=0,point.size=0)+
coord_fixed(xlim=c(-1.08,1.08),ylim=c(-1.08,1.08),expand=F)+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal2)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white"),
plot.margin=margin(0,0,0,0,"cm")
)

ggsave("a.png",width=7,height=7)

This recreates the data I used in this post:

Code:
curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
Printfa% s \\ n Altaian Altaian_Chelkan Azeri Balkar Bashkir Chuvash Dolgan Gagauz Karachai Karakalpak Kazakh Kazakh_China Khakass Khakass_Kachin Kumyk Kyrgyz_China Kyrgyz_Kyrgyzstan Kyrgyz_Tajikistan Kyrgyz_Tajikstan Nogai Nogai_Astrakhan Nogai_krcy_Cherkessia Nogai_Stavropol Salar Shor_Khakassia Shor_mounten Tatar_Kazan Tatar_Mishar Tatar_saiberian Tatar_saiberian_Zabolotniye Todzin Tofalar Tubalar Turkish Turkish_Balikesir Turkmen Tuvinian Uyghur Uzbek Yakut> Pop
x=turk;sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' pop ->$x.temp.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.temp.pick v44.3_HO_public.fam) --make-bed --out $x.temp
plink --allow-no-sex --bfile $x.temp --genome --out $x
awk 'FNR>1&&$10>=.25{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.temp.pick>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .05 --out $x
plink --bfile $x --extract $x.prune.in --make-bed --out $x.pruned
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[j]/n);print o}}' "FS=${1-$'\t'}")
for k in 2 3 4 5 6 8;do admixture -j4 -C .1 $x.pruned.bed $k;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.pruned.fam) $x.pruned

Archetype0ne · May 30, 2021

The scripts in this post visualize a series of ADMIXTURE runs as polygons where each corner of the polygon represents one admixture component.

The second script concatenates the matrices of all K values into a single wide matrix, which is then used to cluster the populations and to connect each population to its three closest neighbors.

Individual samples, light color scheme:

Code:
library(tidyverse)
library(ggforce)
library(ggrepel)

for(n in 3:8){
t=read.table(paste0("euro5.i.",n))%>%set_rownames(paste0(.[,2],":",.[,1]))%>%select(-c(1,2))

t=t[,list(c(2,1,3),c(4,3,2,1),c(2,5,4,3,1),c(4,1,2,3,6,5),c(1,7,5,3,2,6,4),c(2,3,7,4,1,8,5,6))[[n-2]]]

corners=sapply(c(sin,cos),function(x)head(x(seq(0,2,length.out=n+1)*pi),-1))
corners=corners*min(2/diff(apply(corners,2,range)))
corners[,2]=corners[,2]-mean(range(corners[,2]))

xy=as.data.frame(as.matrix(t)%*%corners)
grid=as.data.frame(rbind(cbind(corners,rbind(corners[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol=2,nrow=n,byrow=T))))

pop=rownames(xy)%>%sub(":.*","",.)%>%sub("\\.(DG|SDG|SG|WGA)","",.)
centers=aggregate(xy,by=list(pop),mean)
xy$pop=pop

set.seed(0)
color=as.factor(sample(seq(1,length(unique(xy$pop)))))
cl=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60,100),c(20,50),c(15,40))
hues=max(ceiling(length(color)/nrow(cl)),2)
pal1=as.vector(apply(cl,1,function(x)hcl(head(seq(15,375,length=hues+1),-1),x[1],x[2])))
pal2=as.vector(apply(cl,1,function(x)hcl(head(seq(15,375,length=hues+1),-1),ifelse(x[2]>=60,.5*x[1],.1*x[1]),ifelse(x[2]>=60,.2*x[2],95))))

xy$V1=xy$V1+runif(nrow(xy))/1e3
xy$V2=xy$V2+runif(nrow(xy))/1e3

lims=c(-1,1)*1.08

ggplot(xy,aes(x=V1,y=V2))+
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray85",size=.3)+
geom_voronoi_tile(aes(group=0,fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.055)+
geom_label_repel(data=centers,aes(x=V1,y=V2,label=Group.1,color=color,fill=color),max.overlaps=Inf,point.size=0,size=2.3,alpha=.8,label.r=unit(.1,"lines"),label.padding=unit(.1,"lines"),label.size=.1,box.padding=0,segment.size=.3)+
coord_fixed(xlim=lims,ylim=lims,expand=F)+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal2)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white")
)

ggsave(paste0(n,".png"),width=7,height=7)
}

Population averages, dark color scheme:

Code:
library(tidyverse)
library(ggforce)
library(ggrepel)

for(k in 3:8){
t=read.table(paste0("hqhg19.",k,"a"),row.names=1)
rownames(t)%<>%sub("\\.(DG|SDG|SG|WGA)","",.)

t=t[,list(c(1,2,3),c(2,3,4,1),c(2,5,1,4,3),c(1,5,4,3,6,2),c(4,5,7,3,1,2,6),c(5,3,1,8,6,4,7,2))[[k-2]]]

corners=sapply(c(sin,cos),function(x)head(x(seq(0,2,length.out=k+1)*pi),-1))
corners=corners*min(2/diff(apply(corners,2,range)))
corners[,2]=corners[,2]-mean(range(corners[,2]))

xy=as.data.frame(as.matrix(t)%*%corners)
grid=as.data.frame(rbind(cbind(corners,rbind(corners[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol=2,nrow=k,byrow=T))))

joined=sapply(2:8,function(i)read.table(paste0("hqhg19.",i,"a"))[,-1])%>%do.call(cbind,.)%>%set_rownames(rownames(t))
dist=as.data.frame(as.matrix(dist(joined)))
seg=lapply(1:4,function(i)apply(dist,1,function(x)unlist(xy[names(sort(x)),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
xy$k=as.factor(cutree(hclust(dist(joined)),16))

set.seed(1488)
color=as.factor(sample(seq(length(unique(xy$k)))))
cl=rbind(c(50,90),c(100,80))
hues=max(ceiling(length(color)/nrow(cl)),8)
pal1=as.vector(apply(cl,1,function(x)hcl(head(seq(15,375,length=hues+1),-1),x[1],x[2])))

xy$V1=xy$V1+runif(nrow(xy))/1e3
xy$V2=xy$V2+runif(nrow(xy))/1e3

expand=c(.08,.02)

ggplot(xy,aes(x=V1,y=V2))+
geom_polygon(data=as.data.frame(corners),fill="gray40")+
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray50",size=.5)+
geom_mark_hull(aes(group=k,color=k,fill=k),concavity=1000,radius=unit(.3,"cm"),expand=unit(.3,"cm"),alpha=.2,size=.15)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray20",size=.3)+
geom_point(aes(color=k),size=.5)+
geom_text_repel(aes(label=rownames(xy),color=k),max.overlaps=Inf,force=3,force_pull=2,size=2.3,segment.size=.2,min.segment.length=.2,box.padding=.05)+
coord_fixed(xlim=(1+expand[1])*c(-1,1),ylim=(1+expand[2])*c(-1,1))+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal1)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="gray30"),
panel.grid=element_blank(),
plot.background=element_rect(fill="gray30",color=NA,size=0),
plot.margin=margin(0,0,0,0)
)

ggsave(paste0("t/",k,".png"),width=7,height=7/(Reduce(`/`,1+expand)))
}

Archetype0ne · May 30, 2021

This joins the Q matrices of runs at different K values into a single wide matrix and uses it to cluster populations. The clustering tree is cut into 12 groups which are used to color-code the populations. (I also tried to do the clustering based on an FST matrix, but it made less sense, so I commented out the code for clustering based on FST.)

The branches of the clustering tree are reordered based on the pair of populations with the highest distance, which was Turkish_Balikesir and Nganasan in this case, so that branches closer to Turkish_Balikesir are moved to the top and branches closer to Nganasan are moved to the bottom.

Code:
library(tidyverse)
library(ggdendro)
library(vegan)
library(colorspace)
library(cowplot)

t=read.table("https://pastebin.com/raw/FEwYnBNb",row.names=1) # population averages from ADMIXTURE with population names in first column

t=t[,c(3,2,5,1,4)] # reorder columns (change if your input does not have five columns)
names(t)=paste0("V",1:ncol(t))

# do clustering based on a combined matrix of admixture weights at different K values
joined=sapply(3:8,function(i)read.table(paste0("uralaltaic.",i,".ave"))[,-1])%>%do.call(cbind,.)%>%set_rownames(rownames(t))
hc=hclust(dist(joined))
hc=reorder(hc,-as.matrix(t)%*%seq(ncol(t))^2)
dist=as.matrix(dist(joined))
maxdist=which(dist==max(dist))[1]
hc=reorder(hc,dist[,maxdist%%nrow(dist)]-dist[,maxdist%/%nrow(dist)+1])

# fst=read.csv("https://pastebin.com/raw/ktMkDf24",row.names=1)[rownames(t),rownames(t)]
# fst[fst<0]=0
# maxfst=which(fst==max(fst))[1] # reorder branches based on distance to the pair of populations with the highest FST distance
# hc=reorder(hclust(as.dist(fst)),fst[,maxfst%%nrow(fst)]-fst[,maxfst%/%nrow(fst)+1])
# # hc=reorder(hclust(dist(t)),-as.matrix(t)%*%exp(seq(ncol(t)))) # reorder branches based on the order of the bars

k=as.factor(cutree(hc,12))[hc$labels[hc$order]]

tree=ggdendro::dendro_data(as.dendrogram(hc),type="triangle")

p1=ggplot(ggdendro::segment(tree))+
geom_segment(aes(x=y,y=x,xend=yend,yend=xend),size=.5,lineend="round",color="gray85")+ # `lineend="round"` draws corners properly when not using `type="triangle"`
scale_x_continuous(expand=expansion(mult=c(0,.01)))+ # the small expansion prevents cropping a few pixels from the right border of the tree
scale_y_continuous(limits=.5+c(0,nrow(t)),expand=c(0,0))+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"pt"), # remove extra space normally occupied by tick marks
axis.title=element_blank(),
panel.background=element_rect(fill="gray30"),
panel.grid=element_blank(),
plot.background=element_rect(fill="gray30",color=NA), # `color=NA` removes a thin white border around the plot
plot.margin=margin(5,5,5,0)
)

t=t[hc$labels[hc$order],]
t2=data.frame(V1=rownames(t)[row(t)],V2=colnames(t)[col(t)],V3=unname(do.call(c,t))) # an alternative to `pivot_longer` and `melt`
lab=round(100*t2$V3)
lab[lab<=1]=""

p2=ggplot(t2,aes(x=factor(V1,level=rownames(t)),y=V3,fill=V2))+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T),size=.2,color="gray10")+
geom_text(aes(label=lab),position=position_stack(vjust=.5,reverse=T),size=3.5,color="gray10")+
coord_flip()+
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))+
scale_fill_manual(values=colorspace::hex(HSV(c(30,210,250,310,0),.4,.9)))+
theme(
axis.text=element_text(color=hex(HSV(seq(0,360,length.out=n_distinct(k)+1)%>%head(-1),.4,1))[k],size=11),
axis.text.x=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
plot.background=element_rect(fill="gray30",color=NA),
panel.background=element_rect(fill="gray30"),
plot.margin=margin(5,0,5,5)
)

cowplot:

lot_grid(p2,p1,rel_widths=c(1,.4))
ggsave("a.png",height=.27*nrow(t),width=7)

Archetype0ne · May 30, 2021

David Bush told me about this Perl script which uses R to draw a circular bar chart of an ADMIXTURE run: https://github.com/QidiFeng/Ancestry...stryPainter.pl, https://www.sciencedirect.com/scienc...72022918304273.

I now figured out how to draw similar graphs using the Circlize package: https://jokergoo.github.io/circlize_book/book/.

Display two K values, do clustering based on a combined matrix of runs at different K values:

Code:
library(circlize)
library(vegan) # for reorder.hclust
library(dendextend) # for color_branches (may be masked by the package `seriation`)

f="uralaltaic.i"
quals = c (3.5)
columnorder=list(c(1,2,3),c(1,4,5,3,2))

labelcolor=hcl(c(260,120,60,0,210,160,310,30)+15,60,70)
barcolor=list(hcl(c(310,120,210)+15,60,70),hcl(c(310,260,120,60,210)+15,60,70))

mats=sapply(kvals,function(x)read.table(paste0(f,".",x,"a"),r=1)[,columnorder[lapply(columnorder,length)==x][[1]]])

joined=do.call(cbind,sapply(Sys.glob(paste0(f,".[0-9]a")),function(x)read.table(x,r=1)))
hc=hclust(dist(joined))
hc=reorder(hc,mats[[1]][,3]-mats[[1]][,1])
rows=nrow(joined)

labels=hc$labels[hc$order]
cut=cutree(hc,8)
dend=color_branches(as.dendrogram(hc),k=length(unique(cut)),col=labelcolor[unique(cut[labels])])

circos.clear()
png("a.png",w=2500,h=2500,res=300)
circos.par(cell.padding=c(0,0,0,0))
circos.initialize(0,xlim=c(0,rows))

circos.track(ylim=c(0,1),track.height=.2,track.margin=c(0,0),bg.border=NA,
panel.fun=function(x,y)for(i in 1:rows)circos.text(i-.5,0,labels,adj=c(0,.5),facing="clockwise",niceFacing=T,cex=.65,col=labelcolor[cut[labels]]))

for(i in length(mats):1)
circos.track(ylim=c(0,1),track.height=.2,track.margin=c(0,.01),bg.border=NA,
panel.fun=function(x,y)circos.barplot(as.matrix(mats[])[hc$order,],-.5+1:rows,col=barcolor[],border="gray20",bar_width=1,lwd=.3))

circos.track(ylim=c(0,attr(dend,"height")),track.height=.35,track.margin=c(0,0),bg.border=NA,
panel.fun=function(x,y)circos.dendrogram(dend))

circos.clear()
dev.off()
Display one K value, do clustering based on a single K value, show percentages inside bars:

Code:
library(circlize)
library(vegan) # for reorder.hclust (may be masked by the package `seriation`)
library(dendextend) # for color_branches

t=read.table(text="Kalmyk 0.119357 0.725057 0.000010 0.037803 0.117774
Kyrgyz_China 0.039367 0.512079 0.230150 0.095038 0.123366
Altaian_Chelkan 0.034095 0.000010 0.919478 0.000010 0.046407
Azeri 0.051638 0.004671 0.010727 0.902646 0.030318
Uzbek 0.102725 0.273261 0.001854 0.452126 0.170033
Salar 0.000010 0.539636 0.460334 0.000010 0.000010
Tatar_Kazan 0.113456 0.057026 0.000010 0.099336 0.730171
Tatar_Siberian 0.251376 0.221389 0.000010 0.077505 0.449721
Finnish 0.007214 0.000010 0.000010 0.015174 0.977592
Yakut 0.505434 0.473202 0.000010 0.002914 0.018440
Mansi 0.572791 0.000010 0.000010 0.000010 0.427179
Altaian 0.222424 0.335614 0.358801 0.032694 0.050468
Shor_Mountain 0.233984 0.000010 0.724596 0.000010 0.041400
Chuvash 0.180171 0.011056 0.000010 0.006462 0.802301
Enets 0.920409 0.000010 0.000010 0.000010 0.079561
Yukagir_Tundra 0.710359 0.289611 0.000010 0.000011 0.000010
Kyrgyz_Tajikistan 0.104000 0.563708 0.000010 0.125799 0.206483
Khakass_Kachin 0.254253 0.416760 0.174200 0.005262 0.149525
Tuvinian 0.448940 0.448899 0.000010 0.031803 0.070348
Besermyan 0.209841 0.001487 0.000010 0.000460 0.788202
Nogai_Astrakhan 0.062497 0.463590 0.000010 0.183203 0.290701
Hours 0.725173 0.257670 0.000010 0.005836 0.011312
Kazakh 0.067027 0.518213 0.087979 0.114550 0.212231
Tofalar 0.815599 0.110299 0.000010 0.009693 0.064398
Karakalpak 0.009983 0.316964 0.389103 0.158275 0.125676
Estonian 0.000010 0.000010 0.000010 0.004409 0.995561
Dolgan 0.694025 0.255361 0.000010 0.049624 0.000979
Tatar_Siberian_Zabolotniye 0.521637 0.020132 0.000010 0.000010 0.458212
Uyghur 0.043578 0.486742 0.000010 0.318983 0.150687
Udmurt 0.256391 0.000010 0.001010 0.000010 0.742579
Evenk_FarEast 0.241328 0.606202 0.000010 0.000010 0.152451
Selkup 0.804662 0.000010 0.000010 0.000010 0.195308
Kumyk 0.060751 0.000112 0.000010 0.823905 0.115222
Hungarian 0.000010 0.000010 0.000010 0.244311 0.755659
Tubalar 0.159517 0.009457 0.802778 0.000010 0.028238
Turkmen 0.123631 0.226543 0.000010 0.529793 0.120023
Karelian 0.012854 0.000010 0.000010 0.000010 0.987116
Kazakh_China 0.074285 0.573009 0.152931 0.069362 0.130412
Mongol 0.033174 0.847004 0.025135 0.005178 0.089509
Cycle 0.000010 0.995215 0.000010 0.000010 0.004755
Evenk_Transbaikal 0.611414 0.388556 0.000010 0.000010 0.000010
Nogai_Karachay_Cherkessia 0.119988 0.120774 0.000010 0.617261 0.141967
Veps 0.026887 0.000010 0.000010 0.000010 0.973083
Even 0.441349 0.278457 0.000010 0.015239 0.264946
Nganasan 0.999960 0.000010 0.000010 0.000010 0.000010
Bashkir 0.114088 0.056493 0.251488 0.030390 0.547542
Xibo 0.000010 0.985541 0.000010 0.000362 0.014077
Khakass 0.202707 0.171413 0.530905 0.007675 0.087300
Shor_Khakassia 0.258218 0.000010 0.694889 0.000010 0.046873
Nanai 0.105903 0.894067 0.000010 0.000010 0.000010
Buryat 0.064420 0.848458 0.017066 0.001696 0.068360
Yukagir_Forest 0.379416 0.096266 0.000010 0.003580 0.520728
Karachai 0.067138 0.004534 0.000010 0.798982 0.129336
Mordovian 0.022303 0.001193 0.000010 0.025251 0.951243
Turkish_Balikesir 0.092314 0.038550 0.000010 0.804964 0.064163
Turkish 0.040918 0.012255 0.000010 0.873179 0.073639
Kyrgyz_Kyrgyzstan 0.090129 0.607265 0.000010 0.122885 0.179711
Balkar 0.075115 0.000010 0.000010 0.829730 0.095136
Gagauz 0.000010 0.027887 0.015891 0.601619 0.354593
Nogai_Stavropol 0.070584 0.403817 0.000010 0.244701 0.280888
Negidal 0.248518 0.751452 0.000010 0.000010 0.000010
Tatar_Mishar 0.066112 0.037441 0.010377 0.138008 0.748062",row.names=1)

hc=hclust(dist(t))
hc=reorder(hc,-(t[,1]+t[,2]-t[,4]-2*t[,5]))

labelcolor=hcl(c(260,90,120,60,0,210,180,310)+15,60,70)
barcolor=hcl(c(310,260,120,60,210)+15,60,70)

labels=hc$labels[hc$order]
cut=cutree(hc,8)
dend=color_branches(as.dendrogram(hc),k=length(unique(cut)),col=labelcolor[unique(cut[labels])])
t=t[hc$labels[hc$order],]

circos.clear()
png("a.png",w=2500,h=2500,res=300)
circos.par(cell.padding=c(0,0,0,0),gap.degree=5,points.overflow.warning=F)
circos.initialize("a",xlim=c(0,nrow(t)))

circos.track(ylim=c(0,1),bg.border=NA,track.height=.28,track.margin=c(.01,0),
panel.fun=function(x,y)for(i in 1:nrow(t))circos.text(i-.5,0,labels,adj=c(0,.5),facing="clockwise",niceFacing=T,cex=.65,col=labelcolor[cut[labels]]))

circos.track(ylim=c(0,1),track.margin=c(0,0),track.height=.35,bg.lty=0,panel.fun=function(x,y){
mat=as.matrix(t)
pos=1:nrow(mat)-.5
barwidth=1
for(i in 1:ncol(mat)){
seq1=rowSums(mat[,seq(i-1),drop=F])
seq2=rowSums(mat[,seq(i),drop=F])
circos.rect(pos-barwidth/2,if(i==1){0}else{seq1},pos+barwidth/2,seq2,col=barcolor,border="gray20",lwd=.1)
}
for(i in 1:ncol(mat)){
lab=round(100*t[,i])
lab[lab<=1]=""
seq1=rowSums(mat[,seq(i-1),drop=F])
seq2=rowSums(mat[,seq(i),drop=F])
circos.text(pos,if(i==1){seq1/2}else{seq1+(seq2-seq1)/2},labels=lab,col="gray10",cex=.4,facing="downward")
}
})

circos.track(ylim=c(0,attr(dend,"height")),bg.border=NA,track.margin=c(0,.0015),track.height=.35,panel.fun=function(x,y)circos.dendrogram(dend))

circos.clear()
dev.off()

Archetype0ne · May 30, 2021

The script below displays the FST distance to one population on the x axis and to another population on the y axis. The clustering is based on the full FST matrix. Each population is also connected with a line to the two populations it has the lowest FST distance to.

Code:
library(tidyverse)
library(ggforce) # for geom_mark_hull

t=read.csv("https://pastebin.com/raw/Ja3qPUq3",row.names=1,check.names=F)/1e6

tav=function(x,y)data.frame(aggregate(x,list

,mean),row.names=1,check.names=F)
pops=sub("\\.(DG|SG|SDG|WGA)","",rownames(t))
t=tav(t(tav(t,pops)),pops)

igno=rownames(t)%in%c("Naxi","Chukchi","Samaritan","Itelmen","AA","Pima","Chukchi1","Atayal")
t=t[!igno,!igno]

xpop="Estonian"
ypop="Han"
xy=t[,c(xpop,ypop)]

seg=lapply(1:3,function(i)apply(t,1,function(x)unlist(xy[names(sort(x)),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))

expand=.004
lims=c(-expand,max(apply(xy,2,range))+expand)

names(xy)=c("x","y")
xy$k=as.factor(cutree(hclust(as.dist(t)),16))

lab=sub("^0","",sprintf("%.2f",seq(0,1,.01)))
lab[1]="0"

ggplot(xy,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray20",size=.1)+
geom_mark_hull(aes(color=k,fill=k),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
# geom_polygon(data=xy%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=k,fill=k),size=.3)+ # use convex hulls without rounded corners
geom_point(aes(color=k),size=.5)+
geom_text(aes(color=k),label=rownames(xy),size=2,vjust=-.7)+
# geom_text_repel(aes(color=k),label=rownames(xy),size=2,max.overlaps=Inf,force=2,segment.size=.2,min.segment.length=.2,box.padding=.05)+
coord_cartesian(xlim=lims,ylim=lims,expand=F)+
scale_x_continuous(breaks=seq(0,1,.01),labels=lab)+
scale_y_continuous(breaks=seq(0,1,.01),labels=lab)+
labs(x=paste("FST distance to",xpop),y=paste("FST distance to",ypop))+
theme(
axis.text=element_text(size=6),
axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"cm"),
axis.title=element_text(size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.6),
panel.grid.major=element_line(color="gray90",size=.2)
)

ggsave("a.png",width=7,height=7)

It is also possible to use the FST matrix to do MDS (which is basically a version of PCA that takes a distance matrix as an input):

Code:
library(tidyverse)
library(ggforce) # for geom_mark_hull
library(colorspace) # for hex

t=read.csv("https://pastebin.com/raw/Ja3qPUq3",row.names=1,check.names=F)/1e6

igno=rownames(t)%in%c("Naxi.DG","Chukchi.DG","Samaritan.DG","AA","Pima","Chukchi1","Atayal");t=t[!igno,!igno]

tav=function(x,y)data.frame(aggregate(x,list,mean),row.names=1,check.names=F)
rn2=sub("\\.(DG|SG|SDG|WGA)","",rownames(t));t=tav(t(tav(t,rn2)),rn2)

ncol=20
mds=cmdscale(as.dist(t),k=ncol,eig=T)
p=as.data.frame(mds$points)
eig=head(mds$eig,ncol)
pct=paste0("MDS dimension ",1:ncol," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ranges=apply(p,2,function(x)abs(max(x)-min(x)))
nk=24
p$k=as.factor(cutree(hclust(as.dist(t)),nk))

pal1=hex(HSV(seq(0,360,length.out=nk+1)%>%head(-1),.55,1))
pal2=hex(HSV(seq(0,360,length.out=nk+1)%>%head(-1),.25,1))

for(i in seq(1,7,2)){
seg=lapply(1:4,function(j)apply(t,1,function(x)unlist(p[names(sort(x)[j]),c(i,i+1)],use.names=F))%>%t%>%cbind(p[,c(i,i+1)]))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))

p1=sym(paste0("V",i))
p2=sym(paste0("V",i+1))

maxrange=max(ranges[c(i,i+1)])
lims=sapply(c(i,i+1),function(x)mean(range(t[,x]))+c(-.52*maxrange,.52*maxrange))

ggplot(p,aes(!!p1,!!p2))+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray80",size=.15)+
geom_mark_hull(aes(group=k),color=pal1[p$k],fill=pal1[p$k],concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
geom_point(aes(x=!!p1,y=!!p2),color=pal1[p$k],size=.3)+
geom_text(aes(x=!!p1,y=!!p2),label=rownames(p),color=pal2[p$k],size=2,vjust=-.7)+
labs(x=pct,y=pct[i+1])+
scale_x_continuous(breaks=seq(-1,1,.02))+
scale_y_continuous(breaks=seq(-1,1,.02))+
theme(
aspect.ratio=1,
axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
axis.text=element_text(color="black",size=6),
axis.ticks.length=unit(0,"pt"),
axis.ticks=element_blank(),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="gray40"),
panel.border=element_rect(color="gray30",fill=NA,size=.4),
panel.grid.major=element_line(color="gray30",size=.2),
panel.grid.minor=element_blank(),
plot.background=element_rect(fill="gray40",color=NA)
)

ggsave(paste0(i,".png"),width=7,height=7)
}

The commands for calculating FST I posted earlier listed population names in the wrong order, but here's a fixed version:

Code:
x=eurasiafst
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref')
igno<v44.3_HO_public.ind|grep -Ev '\.SG|\.WGA|_o'|awk '{print$1}'|awk -F\\t 'NR==FNR{a[$0];next}FNR>1&&$2 in a' - v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '(!a[$3]++)&&$6==0&&$15>5e5&&$11>20{print$2,$8}'>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
awk '!a[$2]++{i++}{print$1,i}' <(sort -k2 $x.pick)|awk 'NR==FNR{a[$1]=$2;next}{$6=a[$2]+9}1' - $x.fam>$x.fam.temp;mv $x.fam{.temp,}
plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .1 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
smartpca -p <(printf %s\\n genotypename:\ $x.p.bed snpname:\ $x.p.bim indivname:\ $x.p.fam fstonly:\ YES fsthiprecision:\ YES maxpops:\ 10000)|tee $x.smartpca
p=$(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]}' $x.{pick,fam}|awk '!a[$0]++')
sed -n '/fst \*1000000/,/^$/p' $x.smartpca|sed 1,2d|sed \$d|tr -s \ ,|cut -d, -f3-|paste -d, <(printf %s\\n "$p") -|cat <(printf %s\\n '' "$p"|paste -sd, -) ->$x.fst

The undocumented parameter `fsthiprecision: YES` causes the FST values that are printed to STDOUT to be multiplied by million instead of thousand. It does not affect the contents of the file specified using a `phylipoutname` parameter.

When SmartPCA is used to calculate FST distances, the sixth field of the fam file needs to contain population numbers: https://www.biostars.org/p/266511/. The commands above use integers starting from 10 as the population numbers, because the numbers 1, 2, and 9 have a special meaning (1 assigns the sample as a case, 2 assigns it as a control, and 9 ignores it).

Archetype0ne · May 30, 2021

The set of commands below selects some ancient samples from the 1240K+HO dataset with over 300,000 SNPs. It then runs ADMIXTURE at K=2 and K=3, and it shows the percentage of the East Eurasian component in both runs.

Code:
printf %s\\n Armenia_EBA Azerbaijan_Caucasus_lowlands_LN Belgium_UP_GoyetQ116_1_published_all Canada_BigBar_5700BP.SG China_HMMH_MN China_NEastAsia_Coastal_EN China_NEastAsia_Inland_EN China_SEastAsia_Island_EN China_Shimao_LN China_Tianyuan China_Upper_YR_LN China_WLR_MN China_YR_LN China_YR_MN Croatia_N_Starcevo Czech_Baalberge Czech_Vestonice16 England_Mesolithic_all.SG England_N_all.SG Georgia_Kotias.SG Georgia_Satsurblia.SG Germany_EN_LBK Hungary_EN_HG_Koros Iran_BA1_ShahrISokhta Iran_C_TepeHissar Iran_ShahrISokhta_BA2_published Ireland_N.SG Italy_Mesolithic.SG Italy_North_Villabruna_HG Japan_HG_Jomon Kazakhstan_Botai.SG Kazakhstan_Botai_Eneolithic.SG Kazakhstan_Central_Steppe_EMBA.SG Kazakhstan_Dali_EBA Kazakhstan_Mereke_MBA_o2 Latvia_HG.SG Latvia_MN Latvia_MN_o2 Lithuania_EMN_Narva Luxembourg_Loschbour_published.DG Mongolia_Chalcolithic_Afanasievo_1 Mongolia_EBA_Chemurchek_2 Mongolia_EBA_Ulgii_1 Mongolia_East_N Mongolia_North_N Norway_Mesolithic.SG Norway_N_HG.SG Poland_BKG.SG Poland_TRB.SG Romania_IronGates_Mesolithic Russia_Afanasievo Russia_Afanasievo_published Russia_BA_Okunevo.SG Russia_DevilsCave_N.SG Russia_HG_Karelia Russia_HG_Sosnoviy Russia_HG_Tyumen Russia_Kolyma_M.SG Russia_Kostenki14 Russia_Lokomotiv_Eneolithic.SG Russia_MA1_HG.SG Russia_MBA_Poltavka Russia_MN_Boisman Russia_Potapovka_o1 Russia_Samara_EBA_Yamnaya Russia_Shamanka_Eneolithic.SG Russia_Siberia_Irkutsk_EBA Russia_Siberia_Lena_EBA Russia_Siberia_Lena_EN Russia_Siberia_Tenisei_EBA Russia_Siberia_UP Russia_Sidelkino_HG.SG Russia_Steppe_Maikop Russia_Sunghir1.SG Russia_Sunghir2.SG Russia_Sunghir3.SG Russia_Sunghir4.SG Russia_UstBelaya_Angara Russia_UstBelaya_Angara_published Russia_UstIda_LN.SG Russia_Ust_Ishim_HG_published.DG Russia_Yana_UP.SG Serbia_IronGates_Mesolithic Spain_C Spain_ElMiron Spain_HG.SG Spain_HG_published.SG Sweden_Ansarve_Megalithic.SG Sweden_Mesolithic.SG Sweden_Motala_HG.SG Switzerland_Bichon.SG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>popSG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>popSG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>pop
x=tophog8;awk -F\\t 'NR==FNR{a[$0];next}$8 in a&&$15>3e5&&!a[$3]++{print$2,$8}' pop v44.3_HO_public.anno>$x.pick
alias pli='plink --allow-no-sex'
pli --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
pli --bfile $x --indep-pairwise 50 10 .05 --out $x;pli --bfile $x --extract $x.prune.in --make-bed --out $x.p
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n);print o}}' "FS=${1-$'\t'}")
for k in 2 3;do admixture -j4 -C .1 $x.p.bed $k;awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.{pick,fam}|paste -d' ' - $x.p.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.${k}a;done
paste -d' ' $x.[23]a|awk '{printf"%f %s\n",$3+$5,$0}'|sort -n|awk '{printf"%.0f %.0f %s\n",100*$4,100*$6,$2}'
Output:

0 0 England_Mesolithic_all.SG
0 0 Hungary_EN_HG_Koros
0 0 Italy_North_Villabruna_HG
0 0 Latvia_HG.SG
0 0 Latvia_MN
0 0 Lithuania_EMN_Narva
0 0 Luxembourg_Loschbour_published.DG
0 0 Romania_IronGates_Mesolithic
0 0 Switzerland_Bichon.SG
0 0 Serbia_IronGates_Mesolithic
0 0 Italy_Mesolithic.SG
0 0 England_N_all.SG
0 0 Spain_C
1 0 Sweden_Ansarve_Megalithic.SG
1 0 Czech_Baalberge
1 0 Poland_TRB.SG
2 0 Germany_EN_LBK
1 2 Ukraine_N
3 0 Ireland_N.SG
0 3 Spain_HG_published.SG
0 3 Sweden_Motala_HG.SG
3 0 Turkey_N_published
3 0 Poland_BKG.SG
3 0 Croatia_N_Starcevo
1 3 Sweden_Mesolithic.SG
3 6 Spain_HG.SG
3 6 Norway_Mesolithic.SG
5 5 Ukraine_Mesolithic
4 6 Norway_N_HG.SG
13 0 Armenia_EBA
13 0 Azerbaijan_Caucasus_lowlands_LN
7 7 Latvia_MN_o2
14 3 Russia_Afanasievo_published
13 4 Russia_Samara_EBA_Yamnaya
14 4 Mongolia_Chalcolithic_Afanasievo_1
14 4 Russia_Afanasievo
9 11 Spain_ElMiron
19 1 Georgia_Satsurblia.SG
15 7 Russia_MBA_Poltavka
19 2 Iran_C_TepeHissar
19 3 Georgia_Kotias.SG
20 3 Turkmenistan_MBA_Parkhai
17 7 Kazakhstan_Mereke_MBA_o2
21 4 Uzbekistan_Bustan_Eneolithic
21 6 Turkmenistan_EBA_Parkhai
22 7 Turkmenistan_C_Geoksyur
23 7 Turkmenistan_Gonur_BA_1
16 15 Russia_HG_Karelia
24 8 Iran_BA1_ShahrISokhta
16 16 Russia_Sidelkino_HG.SG
19 16 Czech_Vestonice16
24 20 Russia_Sunghir1.SG
25 20 Russia_Sunghir2.SG
26 19 Russia_Steppe_Maikop
27 19 Russia_Potapovka_o1
26 21 Russia_Kostenki14
26 22 Russia_Sunghir4.SG
27 23 Russia_Sunghir3.SG
26 25 Belgium_UP_GoyetQ116_1_published_all
34 26 Mongolia_EBA_Chemurchek_2
33 28 Russia_MA1_HG.SG
35 31 Russia_HG_Sosnoviy
38 31 Kazakhstan_Dali_EBA
43 31 Iran_ShahrISokhta_BA2_published
41 37 Russia_HG_Tyumen
44 41 Kazakhstan_Botai.SG
Russia_Yana_UP.SG 49 45
50 45 Russia_Ust_Ishim_HG_published.DG
49 Kazakhstan_Botai_Eneolithic.SG 46
62 59 Russia_BA_Okunevo.SG
63 Kazakhstan_Central_Steppe_EMBA.SG 60
71 68 China_Tianyuan
77 Russia_Siberia_Tenisei_EBA 75
82 81 USA_Ancient_Beringian.SG
83 82 USA_Anzick_realigned.SG
84 Canada_BigBar_5700BP.SG 82
86 84 Japan_HG_Jomon
88 86 China_SEastAsia_Island_EN
89 88 Mongolia_EBA_Ulgii_1
90 90 Russia_Kolyma_M.SG
97 96 Russia_Siberia_Lena_EN
99 98 Russia_UstIda_LN.SG
99 99 Russia_Siberia_Lena_EBA
100 99 Russia_UstBelaya_Angara
100 100 China_HMMC_Nast
China
China_NEastAsia_Inland_EN 100 100
100 100 China_Shimao_LN
100 100 China_Upper_YR_LN
100 100 China_WLR_MN
100 100 China_YR_LN
100 100 China_YR_MN
100 100 Mongolia_East_N
100 100 Mongolia_North_N
100 100 Russia_DevilsCave_N.SG
100 100 Russia_Lokomotiv_Eneolithic.SG
100 100 Russia_MN_Boisman
100 100 Russia_Shamanka_Eneolithic.SG
100 100 Russia_Siberia_Irkutsk_EBA
100 100 Russia_UstBelaya_Angara_published

This combines the Q matrices of ADMIXTURE runs at different K values into a single wide matrix. It then uses michal3141's convex optimization script to model a selected population using other populations (https://github.com/michal3141/g25):

Code:
$ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(n in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n);print o}}' "FS=${1-$'\t'}")
$ x=world4;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]}' $x.{pick,fam}) $x.p.*.Q|tav ' '|tr \ ,|awk -F, 'NR==1{for(i=1;i<NF;i++)printf",";print""}1'>/tmp/a
$ curl -sO https://github.com/michal3141/g25/raw/main/mixture.py
$ pip3 install cvxpy
[...]
$ python3 mixture.py <(grep -v Mansi /tmp/a) <(head -n1 /tmp/a;grep Mansi, /tmp/a)
-------------- ANCESTRY BREAKDOWN: -------------
Enets ---> 44.464%
Saami.DG ---> 41.558%
Tatar_Siberian_Zabolotniye ---> 9.567%
Koryak ---> 2.165%
Eskimo_Sireniki.DG ---> 1.691%
Relli_2.DG ---> 0.282%
Kalash.SDG ---> 0.233%
Yadava.DG ---> 0.040%
------------------------------------------------
Fit error: 0.042235320141667455

End of copying threads. I will have to go back and format the code. Since I am so incompetent it took me 1 hr to just copy these posts, and still couldn't figure out how to format the code like this:

Maybe Jovialis can guide me how to format the code that way?

Huge thanks to Nganasankhan for allowing me to share his code on Eupedia. I can only imagine how much time and skill he put into this.

djdarko4 · Oct 7, 2023

Archetype0ne said:
The set of commands below selects some ancient samples from the 1240K+HO dataset with over 300,000 SNPs. It then runs ADMIXTURE at K=2 and K=3, and it shows the percentage of the East Eurasian component in both runs.

Code:
printf %s\\n Armenia_EBA Azerbaijan_Caucasus_lowlands_LN Belgium_UP_GoyetQ116_1_published_all Canada_BigBar_5700BP.SG China_HMMH_MN China_NEastAsia_Coastal_EN China_NEastAsia_Inland_EN China_SEastAsia_Island_EN China_Shimao_LN China_Tianyuan China_Upper_YR_LN China_WLR_MN China_YR_LN China_YR_MN Croatia_N_Starcevo Czech_Baalberge Czech_Vestonice16 England_Mesolithic_all.SG England_N_all.SG Georgia_Kotias.SG Georgia_Satsurblia.SG Germany_EN_LBK Hungary_EN_HG_Koros Iran_BA1_ShahrISokhta Iran_C_TepeHissar Iran_ShahrISokhta_BA2_published Ireland_N.SG Italy_Mesolithic.SG Italy_North_Villabruna_HG Japan_HG_Jomon Kazakhstan_Botai.SG Kazakhstan_Botai_Eneolithic.SG Kazakhstan_Central_Steppe_EMBA.SG Kazakhstan_Dali_EBA Kazakhstan_Mereke_MBA_o2 Latvia_HG.SG Latvia_MN Latvia_MN_o2 Lithuania_EMN_Narva Luxembourg_Loschbour_published.DG Mongolia_Chalcolithic_Afanasievo_1 Mongolia_EBA_Chemurchek_2 Mongolia_EBA_Ulgii_1 Mongolia_East_N Mongolia_North_N Norway_Mesolithic.SG Norway_N_HG.SG Poland_BKG.SG Poland_TRB.SG Romania_IronGates_Mesolithic Russia_Afanasievo Russia_Afanasievo_published Russia_BA_Okunevo.SG Russia_DevilsCave_N.SG Russia_HG_Karelia Russia_HG_Sosnoviy Russia_HG_Tyumen Russia_Kolyma_M.SG Russia_Kostenki14 Russia_Lokomotiv_Eneolithic.SG Russia_MA1_HG.SG Russia_MBA_Poltavka Russia_MN_Boisman Russia_Potapovka_o1 Russia_Samara_EBA_Yamnaya Russia_Shamanka_Eneolithic.SG Russia_Siberia_Irkutsk_EBA Russia_Siberia_Lena_EBA Russia_Siberia_Lena_EN Russia_Siberia_Tenisei_EBA Russia_Siberia_UP Russia_Sidelkino_HG.SG Russia_Steppe_Maikop Russia_Sunghir1.SG Russia_Sunghir2.SG Russia_Sunghir3.SG Russia_Sunghir4.SG Russia_UstBelaya_Angara Russia_UstBelaya_Angara_published Russia_UstIda_LN.SG Russia_Ust_Ishim_HG_published.DG Russia_Yana_UP.SG Serbia_IronGates_Mesolithic Spain_C Spain_ElMiron Spain_HG.SG Spain_HG_published.SG Sweden_Ansarve_Megalithic.SG Sweden_Mesolithic.SG Sweden_Motala_HG.SG Switzerland_Bichon.SG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>popSG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>popSG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>pop
x=tophog8;awk -F\\t 'NR==FNR{a[$0];next}$8 in a&&$15>3e5&&!a[$3]++{print$2,$8}' pop v44.3_HO_public.anno>$x.pick
alias pli='plink --allow-no-sex'
pli --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
pli --bfile $x --indep-pairwise 50 10 .05 --out $x;pli --bfile $x --extract $x.prune.in --make-bed --out $x.p
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n);print o}}' "FS=${1-$'\t'}")
for k in 2 3;do admixture -j4 -C .1 $x.p.bed $k;awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.{pick,fam}|paste -d' ' - $x.p.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.${k}a;done
paste -d' ' $x.[23]a|awk '{printf"%f %s\n",$3+$5,$0}'|sort -n|awk '{printf"%.0f %.0f %s\n",100*$4,100*$6,$2}'
Output:

0 0 England_Mesolithic_all.SG
0 0 Hungary_EN_HG_Koros
0 0 Italy_North_Villabruna_HG
0 0 Latvia_HG.SG
0 0 Latvia_MN
0 0 Lithuania_EMN_Narva
0 0 Luxembourg_Loschbour_published.DG
0 0 Romania_IronGates_Mesolithic
0 0 Switzerland_Bichon.SG
0 0 Serbia_IronGates_Mesolithic
0 0 Italy_Mesolithic.SG
0 0 England_N_all.SG
0 0 Spain_C
1 0 Sweden_Ansarve_Megalithic.SG
1 0 Czech_Baalberge
1 0 Poland_TRB.SG
2 0 Germany_EN_LBK
1 2 Ukraine_N
3 0 Ireland_N.SG
0 3 Spain_HG_published.SG
0 3 Sweden_Motala_HG.SG
3 0 Turkey_N_published
3 0 Poland_BKG.SG
3 0 Croatia_N_Starcevo
1 3 Sweden_Mesolithic.SG
3 6 Spain_HG.SG
3 6 Norway_Mesolithic.SG
5 5 Ukraine_Mesolithic
4 6 Norway_N_HG.SG
13 0 Armenia_EBA
13 0 Azerbaijan_Caucasus_lowlands_LN
7 7 Latvia_MN_o2
14 3 Russia_Afanasievo_published
13 4 Russia_Samara_EBA_Yamnaya
14 4 Mongolia_Chalcolithic_Afanasievo_1
14 4 Russia_Afanasievo
9 11 Spain_ElMiron
19 1 Georgia_Satsurblia.SG
15 7 Russia_MBA_Poltavka
19 2 Iran_C_TepeHissar
19 3 Georgia_Kotias.SG
20 3 Turkmenistan_MBA_Parkhai
17 7 Kazakhstan_Mereke_MBA_o2
21 4 Uzbekistan_Bustan_Eneolithic
21 6 Turkmenistan_EBA_Parkhai
22 7 Turkmenistan_C_Geoksyur
23 7 Turkmenistan_Gonur_BA_1
16 15 Russia_HG_Karelia
24 8 Iran_BA1_ShahrISokhta
16 16 Russia_Sidelkino_HG.SG
19 16 Czech_Vestonice16
24 20 Russia_Sunghir1.SG
25 20 Russia_Sunghir2.SG
26 19 Russia_Steppe_Maikop
27 19 Russia_Potapovka_o1
26 21 Russia_Kostenki14
26 22 Russia_Sunghir4.SG
27 23 Russia_Sunghir3.SG
26 25 Belgium_UP_GoyetQ116_1_published_all
34 26 Mongolia_EBA_Chemurchek_2
33 28 Russia_MA1_HG.SG
35 31 Russia_HG_Sosnoviy
38 31 Kazakhstan_Dali_EBA
43 31 Iran_ShahrISokhta_BA2_published
41 37 Russia_HG_Tyumen
44 41 Kazakhstan_Botai.SG
Russia_Yana_UP.SG 49 45
50 45 Russia_Ust_Ishim_HG_published.DG
49 Kazakhstan_Botai_Eneolithic.SG 46
62 59 Russia_BA_Okunevo.SG
63 Kazakhstan_Central_Steppe_EMBA.SG 60
71 68 China_Tianyuan
77 Russia_Siberia_Tenisei_EBA 75
82 81 USA_Ancient_Beringian.SG
83 82 USA_Anzick_realigned.SG
84 Canada_BigBar_5700BP.SG 82
86 84 Japan_HG_Jomon
88 86 China_SEastAsia_Island_EN
89 88 Mongolia_EBA_Ulgii_1
90 90 Russia_Kolyma_M.SG
97 96 Russia_Siberia_Lena_EN
99 98 Russia_UstIda_LN.SG
99 99 Russia_Siberia_Lena_EBA
100 99 Russia_UstBelaya_Angara
100 100 China_HMMC_Nast
China
China_NEastAsia_Inland_EN 100 100
100 100 China_Shimao_LN
100 100 China_Upper_YR_LN
100 100 China_WLR_MN
100 100 China_YR_LN
100 100 China_YR_MN
100 100 Mongolia_East_N
100 100 Mongolia_North_N
100 100 Russia_DevilsCave_N.SG
100 100 Russia_Lokomotiv_Eneolithic.SG
100 100 Russia_MN_Boisman
100 100 Russia_Shamanka_Eneolithic.SG
100 100 Russia_Siberia_Irkutsk_EBA
100 100 Russia_UstBelaya_Angara_published

This combines the Q matrices of ADMIXTURE runs at different K values into a single wide matrix. It then uses michal3141's convex optimization script to model a selected population using other populations (https://github.com/michal3141/g25):

Code:
$ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(n in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n);print o}}' "FS=${1-$'\t'}")
$ x=world4;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]}' $x.{pick,fam}) $x.p.*.Q|tav ' '|tr \ ,|awk -F, 'NR==1{for(i=1;i<NF;i++)printf",";print""}1'>/tmp/a
$ curl -sO https://github.com/michal3141/g25/raw/main/mixture.py
$ pip3 install cvxpy
[...]
$ python3 mixture.py <(grep -v Mansi /tmp/a) <(head -n1 /tmp/a;grep Mansi, /tmp/a)
-------------- ANCESTRY BREAKDOWN: -------------
Enets ---> 44.464%
Saami.DG ---> 41.558%
Tatar_Siberian_Zabolotniye ---> 9.567%
Koryak ---> 2.165%
Eskimo_Sireniki.DG ---> 1.691%
Relli_2.DG ---> 0.282%
Kalash.SDG ---> 0.233%
Yadava.DG ---> 0.040%
------------------------------------------------
Fit error: 0.042235320141667455

End of copying threads. I will have to go back and format the code. Since I am so incompetent it took me 1 hr to just copy these posts, and still couldn't figure out how to format the code like this:

Maybe Jovialis can guide me how to format the code that way?

Huge thanks to Nganasankhan for allowing me to share his code on Eupedia. I can only imagine how much time and skill he put into this.

I need help with running Outgroup F3 Statistics could you please guide me through that

Archetype0ne · Oct 7, 2023

djdarko4 said:
I need help with running Outgroup F3 Statistics could you please guide me through that

I have not played with R and Fstats in a while. I think right now Jovialis or Eupator are the best address for an answer.

djdarko4 · Oct 8, 2023

Archetype0ne said:
I have not played with R and Fstats in a while. I think right now Jovialis or Eupator are the best address for an answer.

@eupator @Jovialis Any advice guys ?

Jovialis · Oct 8, 2023

I'm still pretty novice at admixtools , I'd defer to @eupator

eupator · Oct 11, 2023

Code:

> mypops = c('Mbuti.DG', 'Armenian.DG', 'Georgian.DG', 'Cretan.DG', 'Colchis_Hellenistic')
> extract_f2(prefix, my_f2_dir, pops = mypops, overwrite = TRUE, maxmiss = 1)
ℹ Reading allele frequencies from packedancestrymap files...
ℹ dosasv54merged.geno has 16657 samples and 1233013 SNPs
ℹ Calculating allele frequencies from 22 samples in 6 populations
ℹ Expected size of allele frequency data: 197 MB
1233k SNPs read...
✔ 1233013 SNPs read in total
! 1150639 SNPs remain after filtering. 973726 are polymorphic.
ℹ Allele frequency matrix for 1150639 SNPs and 6 populations is 138 MB
ℹ Computing pairwise f2 for all SNPs and population pairs requires 1657 MB RAM without splitting
ℹ Computing without splitting since 1657 < 8000 (maxmem)...
ℹ Data written to C:\Users\*****\Documents\my_f2_dir_dosas/
> f2_blocks = f2_from_precomp(my_f2_dir, pops = mypops, afprod = TRUE)
ℹ Reading precomputed data for 6 populations...
ℹ Reading ap data for pair 21 out of 21...
Warning message:
In read_f2(dir, pops, pops2, type = type, remove_na = remove_na,  :
 Discarding 1 block(s) due to missing values!
Discarded block(s): 538
> 
> pop1 = c('Mbuti.DG')
> pop2 = c('Cretan.DG')
> pop3 = c('Armenian.DG', 'Georgian.DG','Colchis_Hellenistic')
> qp3pop(f2_blocks, pop1, pop2, pop3)
# A tibble: 3 × 7
 pop1     pop2      pop3                  est       se     z     p
 <chr>    <chr>     <chr>               <dbl>    <dbl> <dbl> <dbl>
1 Mbuti.DG Cretan.DG Armenian.DG         0.184 0.000527  349.     0
2 Mbuti.DG Cretan.DG Colchis_Hellenistic 0.184 0.000573  321.     0
3 Mbuti.DG Cretan.DG Georgian.DG         0.184 0.000532  345.     0

djdarko4 · Oct 13, 2023

eupator said:

Code:

> mypops = c('Mbuti.DG', 'Armenian.DG', 'Georgian.DG', 'Cretan.DG', 'Colchis_Hellenistic')
> extract_f2(prefix, my_f2_dir, pops = mypops, overwrite = TRUE, maxmiss = 1)
ℹ Reading allele frequencies from packedancestrymap files...
ℹ dosasv54merged.geno has 16657 samples and 1233013 SNPs
ℹ Calculating allele frequencies from 22 samples in 6 populations
ℹ Expected size of allele frequency data: 197 MB
1233k SNPs read...
✔ 1233013 SNPs read in total
! 1150639 SNPs remain after filtering. 973726 are polymorphic.
ℹ Allele frequency matrix for 1150639 SNPs and 6 populations is 138 MB
ℹ Computing pairwise f2 for all SNPs and population pairs requires 1657 MB RAM without splitting
ℹ Computing without splitting since 1657 < 8000 (maxmem)...
ℹ Data written to C:\Users\*****\Documents\my_f2_dir_dosas/
> f2_blocks = f2_from_precomp(my_f2_dir, pops = mypops, afprod = TRUE)
ℹ Reading precomputed data for 6 populations...
ℹ Reading ap data for pair 21 out of 21...
Warning message:
In read_f2(dir, pops, pops2, type = type, remove_na = remove_na,  :
 Discarding 1 block(s) due to missing values!
Discarded block(s): 538
>
> pop1 = c('Mbuti.DG')
> pop2 = c('Cretan.DG')
> pop3 = c('Armenian.DG', 'Georgian.DG','Colchis_Hellenistic')
> qp3pop(f2_blocks, pop1, pop2, pop3)
# A tibble: 3 × 7
 pop1     pop2      pop3                  est       se     z     p
 <chr>    <chr>     <chr>               <dbl>    <dbl> <dbl> <dbl>
1 Mbuti.DG Cretan.DG Armenian.DG         0.184 0.000527  349.     0
2 Mbuti.DG Cretan.DG Colchis_Hellenistic 0.184 0.000573  321.     0
3 Mbuti.DG Cretan.DG Georgian.DG         0.184 0.000532  345.     0

Thanks man appreciated , Can you please tell me how to make this as a plot like i wanna see who's sharing the most drift with a certain population like this

R scripts for ADMIXTOOLS 2 [Nganasankhan from Anthrogenica]

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Newbie

Regular Member

Newbie

Advisor

destroyer of delusions

Newbie

Attachments