Eupedia Forums
Site NavigationEupedia Top > Eupedia Forum & Japan Forum
Results 1 to 14 of 14

Thread: R scripts for ADMIXTOOLS 2 [Nganasankhan from Anthrogenica]

  1. #1
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    R scripts for ADMIXTOOLS 2 [Nganasankhan from Anthrogenica]

    I saw a very interesting post by Nganasankhan on Anthrogenica. I asked him wether he would like to post this on Eupedia, or to get his permission to post it with credit in case he is not interested. He said he is already getting bored with genetics and had no interest joining another forum. But here are some of his posts. If any of you guys can run scripts/code, his published codes could be of great use for other populations/examples. The following post is straight copied from his thread with his permission.




    The scripts in this post use the 1240K+HO dataset, which can be downloaded from here: https://reich.hms.harvard.edu/allen-...cient-dna-data.

    This installs ADMIXTOOLS 2 (https://github.com/uqrmaie1/admixtools):

    Code:
    install.packages("devtools")
    devtools::install_github("uqrmaie1/admixtools")
    library("admixtools")

    When I tried to install `devtools`, it failed at first because there was an error in installing the dependency `gert` of its dependency `usethis`. By running `install.packages("gert")`, I saw that I needed to run `brew install libgit2`. Even then, `admixtools` failed to install because there was an error in installing its dependency `igraph`. By googling for the error, I saw that I needed to run `brew uninstall --ignore-dependencies suite-sparse`.


    f3 as percentages with min-max scaling

    This shows the amount of shared drift with Han when using Yoruba as an outgroup. The values are min-max scaled so that they are displayed on a scale where Villabruna is 0 and Nganasan is 100.

    Code:
    $ Printf% s \\ n Bashkir Besermyan Chuvash Enets Estonia_N_CombCeramic.SG Estonian.DG Finnish Finnish.DG Hungarian.DG Italy_North_Villabruna_HG Karelian Mansi Mari.SG Mordovian Nganasan Norway_Viking_o1.SG Russia_Bolshoy Russia_Chalmny_Varre Russia_Chalmny_Varre Russia_EHG Russia_MLBA_Krasnoyarsk_o Russia_Shamanka_Eneolithic.SG Saami.DG Selkup Tatar_Mishar Tatar_Siberian_Zabolotniye Todzin Tofalar Udmurt> pops
    $ R -e 'library(tidyverse);library(admixtools);f3=f3("pat h/to/v44.3_HO_public","Yoruba","Han",readLines("pops")) %>%arrange(est);est=f3$est-min(f3$est);est=est/max(est);cat(paste(round(100*est),f3$pop3),sep="\n ")'
    [...]
    0 Italy_North_Villabruna_HG
    2 Hungarian.DG
    6 Estonian.DG
    12 Finnish.DG
    12 Mordovian
    13 Finnish
    13 Russia_EHG
    14 Karelian
    20 Tatar_Mishar
    26 Besermyan
    27 Chuvash
    28 Mari.SG
    29 Russia_Chalmny_Varre
    30 Udmurt
    33 Saami.DG
    37 Russia_MLBA_Krasnoyarsk_o
    38 Bashkir
    40 Norway_Viking_o1.SG
    54 Russia_Bolshoy
    55 Tatar_Siberian_Zabolotniye
    57 Mansi
    70 Selkup
    80 Enets
    90 Tofalar
    94 hours
    99 Russia_Shamanka_Eneolithic.SG
    100 Nganasan

    If you get a low number of SNPs that remain after filtering, try to remove samples with a low SNP count. For example in a run before the run shown above, my SNP count was 12245. I ran a command like this to find samples with a low SNP count:

    Code:
    $ awk -F\\t 'NR==FNR{a[$0];next}$8 in a{print$15,$8,$2}' pops path/to/v44.3_HO_public.anno|sort -n|head -n8
    30001 Estonia_N_CombCeramic.SG MA974.SG
    58972 Estonia_N_CombCeramic.SG MA975.SG
    201719 Russia_Chalmny_Varre CHV002.A0101
    205900 Norway_Viking_o1.SG VK518.SG
    218757 Russia_Shamanka_Eneolithic.SG DA251.SG
    287462 Russia_Bolshoy BOO002.A0101
    289954 Russia_EHG UzOO77
    300800 Russia_Shamanka_Eneolithic.SG DA250.SG


    After I removed Estonia_N_CombCeramic.SG, my SNP count increased to 73054.


    f3 as a heatmap

    This applies min-max scaling to f3 values and displays the scaled values as percentages:

    Code:
    library(tidyverse)
    library(admixtools)
    library(pheatmap)
    library(colorspace)

    minmax=function(x)(x-min(x,na.rm=T))/(max(x,na.rm=T)-min(x,na.rm=T))

    p1="Yoruba.DG"
    p2=c("Nganasan","Russia_HG_Karelia","Norway_N_HG.S G","Turkey_N","Russia_HG_Tyumen","Han","Estonia_Co rdedWare","Russia_Bolshoy","Luxembourg_Loschbour", "Russia_Samara_EBA_Yamnaya","Georgia_Kotias.SG ")
    p3=c("Nganasan","Estonian.DG","Selkup","Hungarian. DG","Karelian","Mansi","Udmurt","Mordovian","Saami .DG","Besermyan","Finnish","Finland_Levanluhta","T ofalar","Russia_MLBA_Krasnoyarsk_o","Russia_Bolsho y","Enets","Russia_Chalmny_Varre","Norwegian")

    f3=f3("path/to/v44.3_HO_public",p1,p2,p3)

    t=f3%>%select(pop2,pop3,est)%>%pivot_wider(names_f rom=pop2,values_from=est)%>%data.frame(row.names=1 )

    # don't show the f3 values of populations with themselves in order to not stretch the color scale
    makena=names(which(sapply(rownames(t),function(x)a ny(names(t)==x))))
    lapply(makena,function(x)t[x,x]<<-NA)

    range=apply(t,2,function(x)paste(sub("^0","",sprin tf("%.3f",range(x,na.rm=T))),collapse="-"))
    names(t)=paste0(names(t)," (",range,")")

    t=minmax(t) # use the same scale for the whole table
    # t=apply(t,2,function(x)minmax(x)) # scale each column individually

    pheatmap (
    t*100,filename="a.png",
    # cluster_cols=F,cluster_rows=F,
    treeheight_row=70,treeheight_col=70,
    legend=F,cellwidth=16,cellheight=16,fontsize=8,bor der_color=NA,
    display_numbers=T,number_format="%.0f",fontsize_nu mber=7,number_color="black",
    colorRampPalette(colorspace::hex(HSV(c(210,170,120 ,60,40,20,0),.5,1)))(256)
    )


    Here min-max scaling was applied to the whole table on the left side and to each column individually on the right side:


    The script below displays the original f3 values without scaling. It also demonstrates how `vegan::reorder.hclust` can be used to reorder the branches of the heatmap.

    Code:
    library(tidyverse)
    library(admixtools)
    library(pheatmap)
    library(colorspace) # for hex
    library(vegan) # for reorder.hclust

    p1="Yoruba.DG"
    p2=c("Nganasan","Russia_HG_Karelia","Norway_N_HG.S G","Turkey_N","Russia_HG_Tyumen","Han","Estonia_Co rdedWare","Russia_Bolshoy","Luxembourg_Loschbour", "Russia_Samara_EBA_Yamnaya","Georgia_Kotias.SG","F innish","Finland_Levanluhta","Russia_Shamanka_Eneo lithic.SG")

    f3=f3("path/to/v44.3_HO_public",p1,p2,p2)

    t=f3%>%select(pop2,pop3,est)%>%pivot_wider(names_f rom=pop2,values_from=est)%>%data.frame(row.names=1 )

    diag(t)=NA # don't show the f3 values of populations with themselves in order to not stretch the color scale

    pheatmap (
    t,filename="a.png",
    clustering_callback=function(...){reorder(hclust(d ist(t)),wts=t[,"Han"]-t[,"Turkey_N"])},
    display_numbers=apply(t,1,function(x)sub("^0","",s printf("%.3f",x))),
    legend=F,border_color=NA,cellwidth=16,cellheight=1 6,fontsize=8,fontsize_number=7,number_color="black ",
    # breaks=seq(.160,.190,.03/256), # use fixed limits for color scale
    colorRampPalette(hex(HSV(c(210,170,120,60,40,20,0) ,.5,1)))(256)
    )

    Result:



    f3-outgroup nMonte

    This saves an f3 table as a CSV file so it can be used with Vahaduo or nMonte. It uses a shell wrapper for my simplified version of nMonte3.R.

    Code:
    $ printf% s \\ n Estonia_CordedWare Finland_Levanluhta Finnish Georgia_Kotias.SG Han Luxembourg_Loschbour Nganasan Norway_N_HG.SG Russia_Bolshoy Russia_HG_Karelia Russia_HG_Tyumen Russia_Samara_EBA_Yamnaya Turkey_Shamanka_EBA_Yamnaya Turkey_Shamanka_EBA
    $ R -e 'library(tidyverse);library(admixtools);pops=readL ines("pops");f3=f3("path/to/v44.3_HO_public","Yoruba.DG",pops,pops);f3%>%selec t(pop2,pop3,est)%>%pivot_wider(names_from=pop3,val ues_from=est)%>%mutate(across(where(is.double),rou nd,6))%>%write_csv("f3")'
    $ cat f3
    pop2,Estonia_CordedWare,Finland_Levanluhta,Finnish ,Georgia_Kotias.SG,Han,Luxembourg_Loschbour,Nganas an,Norway_N_HG.SG,Russia_Bolshoy,Russia_HG_Karelia ,Russia_HG_Tyumen,Russia_Samara_EBA_Yamnaya,Russia _Shamanka_Eneolithic.SG,Turkey_N
    Estonia_CordedWare,0.189574,0.048826,0.050368,0.04 6842,0.040376,0.05185,0.042084,0.049919,0.04784,0. 050933,0.047266,0.051533,0.040938,0.049494
    Finland_Levanluhta, 0.048826,0.148686,0.051402,0.045953,0.045448,0.052 605,0.048901,0.053542,0.052522,0.053938,0.054203,0 .051491,0.046715,0.048399
    Finnish,0.050368,0.051402,0.052422,0.047057,0.0423 04,0.053757,0.04443,0.052616,0.048967,0.053264,0.0 49454,0.050761,0.04272,0.0495
    Georgia_Kotias.SG,0.046842,0.045953,0.047057,0.185 421,0.0394,0.045323,0.041384,0.046218,0.044534,0.0 47698,0.044243,0.048927,0.040304,0.046263
    Han, 0.040376,0.045448,0.042304,0.0394,0.061217,0.04094 2,0.057035,0.041632,0.049274,0.043826,0.044467,0.0 40925,0.057394,0.038767
    Luxembourg_Loschbour,0.05185,0.052605,0.053757,0.0 45323,0.040942,0.188232,0.043033,0.062498,0.050054 ,0.057409,0.049533,0.051483,0.04226,0.049539
    Nganasan,0.042084,0.048901,0.04443,0.041384,0.0570 35,0.043033,0.070154,0.04448,0.054056,0.047517,0.0 47917,0.043077,0.059516,0.039822
    Norway_N_HG.SG,0.049919,0.053542,0.052616,0.046218 ,0.041632,0.062498,0.04448,0.189739,0.053043,0.062 429,0.053055,0.052822,0.043558,0.048115
    Russia_Bolshoy, 0.04784,0.052522,0.048967,0.044534,0.049274,0.0500 54,0.054056,0.053043,0.063846,0.057001,0.054654,0. 049657,0.051358,0.04458
    Russia_HG_Karelia,0.050933,0.053938,0.053264,0.047 698,0.043826,0.057409,0.047517,0.062429,0.057001,0 .170124,0.062403,0.055766,0.046348,0.048079
    Russia_HG_Tyumen,0.047266,0.054203,0.049454,0.0442 43,0.044467,0.049533,0.047917,0.053055,0.054654,0. 062403,0.18867,0.054148,0.046497,0.043883
    Russia_Samara_EBA_Yamnaya, 0.051533,0.051491,0.050761,0.048927,0.040925,0.051 483,0.043077,0.052822,0.049657,0.0557660,0.0.05414 ,05414
    Russia_Shamanka_Eneolithic.SG,0.040938,0.046715,0. 04272,0.040304,0.057394,0.04226,0.059516,0.043558, 0.051358,0.046348,0.046497,0.041908,0.064461,0.039 068
    Turkey_N,0.049494,0.048399,0.0495,0.046263,0.03876 7,0.049539,0.039822,0.048115,0.04458,0.048079,0.04 3883,0.047615,0.039068,0.055858
    $ nm()(grep -v ^, "$1">/tmp/nm1;grep -v ^, "$2">/tmp/nm2;Rscript -e 'source("~/bin/nmon.R");a=as.numeric(commandArgs(T));getMonte("/tmp/nm1","/tmp/nm2",Nbatch=a[1],Ncycles=a[2],pen=a[3])' "${3-500}" "${4-1000}" "${5-.001}")
    $ nm <(sed 1d f3|grep -v Finnish) <(sed 1d f3|grep Finnish)
    Target: Finnish (d=.0028)
    45.2 Russia_Samara_EBA_Yamnaya
    35.6 Turkey_N
    5.0 Russia_Bolshoy
    3.4 Nganasan
    3.0 Han
    2.6 Luxembourg_Loschbour
    1.4 Finland_Levanluhta
    1.4 Norway_N_HG.SG
    1.2 Russia_Shamanka_Eneolithic.SG
    0.8 Russia_HG_Karelia
    0.4 Estonia_CordedWare
    $ cat ~/bin/nmon.R
    getMonte=function(datafile,targetfile,Nbatch=500,N cycles=1000,pen=0){
    do_algorithm=function(selection,targ){
    mySel = as.matrix (selection, rownames.force = NA)
    myTg=as.matrix(targ,rownames.force=NA)
    dif2targ=sweep(mySel,2,myTg,"-")
    Ndata = nrow (dif2targ)
    kcol=ncol(dif2targ)
    rowLabels=rownames(dif2targ)
    matPop=matrix(NA_integer_,Nbatch,1,byrow=T)
    dumPop=matrix(NA_integer_,Nbatch,1,byrow=T)
    matAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
    dumAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
    matPop = sample (1: Data, Nbatch, replace = T)
    matAdmix=dif2targ[matPop,]
    colM1 = colMeans (matAdmix)
    eval1=(1+pen)*sum(colM1^2)
    for(c in 1:Ncycles){
    dumPop=sample(1:Ndata,Nbatch,replace=T)
    dumAdmix=dif2targ[dumPop,]
    for(b in 1:Nbatch){
    store=matAdmix[b,]
    matAdmix[b,]=dumAdmix[b,]
    colM2 = colMeans (matAdmix)
    eval2=sum(colM2^2)+pen*sum(matAdmix[b,]^2)
    if(eval2<=eval1){
    matPop[b]=dumPop[b]
    colM1=colM2
    eval1=eval2
    }else{matAdmix[b,]=store}
    }
    }
    fitted = t (colMeans (matAdmix) + myTg [1,])
    popl=unlist(lapply(strsplit(rowLabels[matPop],";"),function(x)x[1]))
    populations=factor(popl)
    return(list("estimated"=fitted,"pops"=populations) )
    }

    source=read.csv(datafile,header=F,row.names=1)
    target=read.csv(targetfile,header=F,row.names=1)

    for(i in 1:nrow(target)){
    row=target[i,]
    result=do_algorithm(source,row)

    if(i>=2)cat("\n")
    dist=sqrt(sum((result$estimated-row)^2))
    cat(paste0("Target: ",row.names(row)," (d=",sub("^0","",sprintf("%.4f",dist)),")\n"))

    tb=table(result$pops)
    tb=sort(tb,decreasing=T)
    tb=as.matrix(100*tb/Nbatch)
    cat(paste(sprintf("%.1f",tb),row.names(tb)),sep="\ n")
    }
    }

    qpGraph with predefined node list

    Code:
    t=read.table(text="R Yoruba.DG
    R A
    TO THE
    A AA
    OE
    O ONLY
    AND I
    E L1
    I WH
    I EH
    AA EH
    EH EH2
    EH EH1
    EH AAA
    AA AAA
    WH CCE
    WH Italy_North_Villabruna_HG
    CH YM
    CH CH1
    CH ON
    EH1 YM
    EH1 EH0
    AAA AAN
    CH1 EH0
    CH1 Georgia_Kotias.SG
    L1 Turkey_N_published
    L1 LB
    L1 YM1
    L1 CW
    EH2 LB
    EH2 CWC
    YM YM1
    YM CW
    EH0 Russia_HG_Karelia
    AAN Nganasan
    TO CCA
    LB Germany_EN_LBK
    LB CCW
    YM1 Russia_Samara_EBA_Yamnaya
    CW CCW
    CW Estonia_CordedWare
    CCW CWC
    CWC CCE
    CCE CCA
    CCA Finnish.DG")

    pop=c("Finnish.DG","Estonia_CordedWare","Germany_E N_LBK","Italy_North_Villabruna_HG","Nganasan","Rus sia_HG_Karelia","Russia_Samara_EBA_Yamnaya","Turke y_N_published","Yoruba.DG","Georgia_Kotias.SG")

    f2=f2_from_geno("path/to/v44.3_HO_public",pops=pop)
    gr=qpgraph(f2,t)

    plot_graph(gr$edges)
    ggsave("a.png",width=9,height=9)
    system("mogrify -trim -border 64 -bordercolor white a.png")

    plotly_graph(gr$edges) # view Plotly graph in browser

    In order to generate a black-and-white graph, run `write_dot(gr$edges)` and paste the output here: http://viz-js.com.

    qpAdm popdrop table as a stacked bar chart

    The `popdrop` table contains alternative models for each non-empty subset of the left populations. The script below selects all models from the `popdrop` table with zero negative weights (where the value of the `feasible` column is not false) and with at least two left populations (where `f4rank` is not 0). It then plots the models sorted by their p score.

    In qpAdm, the number of right populations needs to be greater than or equal to the number of left populations. Otherwise you'll get an error like this: `Error in qpadm_weights(f4_est, qinv, rnk = length(left) - 2, fudge = fudge: Mat::cols(): indices out of bounds or incorrectly used`.

    I'm still afraid of qpAdm because I have no idea how to choose the outgroups, but I just selected a subset of the outgroups that were used in the paper titled "The Genomic History of Southeastern Europe": https://anthrogenica.com/showthread....pains-of-qpAdm. When the genetic distance between the right populations to left populations is high like in my model, the p values also tend to be higher.

    Code:
    library(admixtools)
    library(tidyverse)
    library(reshape2) # for melt
    library(colorspace) # for hex
    library(cowplot) # for plot_grid

    left=c("Turkey_Boncuklu_N.SG","Estonia_CordedWare. SG","Latvia_HG","Russia_HG_Karelia","Finland_Levan luhta","Nganasan")
    right=c("Ethiopia_4500BP_published.SG","Mbuti.DG", "Russia_Ust_Ishim_HG_published.DG","Karitiana.DG", "Papuan.DG","ONG.SG")
    target="Mari.SG"

    qp=qpadm("path/to/v44.3_HO_public",left,right,target)

    t=qp$popdrop%>%filter(feasible==T&f4rank!=0)%>%arr ange(desc(p))%>%select(1,4,5,7:last_col(5))
    t2=melt(t[-c(2,3)],id.vars="pat") # this uses `melt` because `pivot_longer` doesn't preserve the order of columns

    lab=sub("e-0","e-",sub("^0","",sprintf(ifelse(t$p<.001,"%.0e","%.3f "),t$p)))
    lab=paste0(lab," (",ifelse(t$chisq<10,sub("^0","",sprintf("%.1f",t$ chisq)),round(t$chisq)),")")

    subtit=str_wrap(paste("Right:",paste(sort(right),c ollapse=", ")),width=58)

    p=ggplot(t2,aes(x=fct_rev(factor(pat,level=t$pat)) ,y=value,fill=variable))+
    geom_bar(stat="identity",width=1,position=position _fill(reverse=T))+
    geom_text(aes(label=round(100*value)),position=pos ition_stack(vjust=.5,reverse=T),size=3.5)+
    ggtitle(paste("Target:",target),subtitle=subtit)+
    coord_flip()+
    scale_x_discrete(expand=c(0,0),labels=rev(lab))+
    scale_y_discrete(expand=c(0,0))+ # remove gray margins on the left and right side of the plot
    scale_fill_manual("legend",values=hex(HSV(c(30,60, 180,210,270,330),.5,1)))+ # use manual colors
    # scale_fill_manual(values=colorspace::hex(HSV(head( seq(0,360,length.out=1+ncol(t)-3),-1),.5,1)))+ # use automatic colors
    guides(fill=guide_legend(ncol=2,byrow=F))+
    theme(
    axis.text.x=element_blank(),
    axis.text.y=element_text(size=11),
    axis.text=element_text(color="black"),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.direction="horizontal",
    legend.key=element_rect(fill=NA), # remove gray border around color squares
    legend.margin=margin(-4,0,1,0),
    legend.text=element_text(size=11),
    legend.title=element_blank(),
    plot.subtitle=element_text(size=12),
    plot.title=element_text(size=16)
    )

    ggdraw(p)
    hei=c(.5+.2*(str_count(subtit,"\n")+1)+.25*nrow(t) ,.1+.23*ceiling(length(unique(t2[!is.na(t2$value),2]))/2))
    ggdraw(plot_grid(p+theme(legend.position="none"),g et_legend(p),ncol=1,rel_heights=hei))
    ggsave("a.png",width=5.5,height=sum(hei))

    In the image below, the number in parenthes after the p-value is chi-squared. Models with a large number of left populations tend to get a low chi-squared relative to the p-value. Models with only two left populations tend to get a high chi-squared relative to the p-value.



    Print qpAdm popdrop table as plain text

    Code:
    library(admixtools)
    library(tidyverse)

    left=c("Turkey_Boncuklu_N.SG","Russia_Samara_EBA_Y amnaya","Latvia_HG","Russia_HG_Karelia","Nganasan" ,"Finland_Levanluhta","Russia_Bolshoy")
    right=c("Russia_Ust_Ishim_HG_published.DG","Russia _HG_Tyumen","Belgium_UP_GoyetQ116_1_published_all" ,"Russia_Kostenki14","Iran_Wezmeh_N.SG","Georgia_K otias.SG","Turkey_Epipaleolithic","China_YR_LN")
    target="Finnish"

    qp=qpadm("path/to/v44.3_HO_public",left,right,target)

    t=qp$popdrop%>%filter(feasible==T&f4rank!=0)%>%arr ange(desc(p))
    p=apply(t%>%select(7:last_col(5)),1,function(x){y= round(100*sort(na.omit(x),decreasing=T));paste(pas te0(y,"% ",names(y)),collapse=" + ")})
    paste0(sub("^0","",sprintf(ifelse(t$p<.001,"%.0e", "%.3f"),t$p)),": ",p)%>%cat(sep="\n")

    Output:

    .630: 45% Russia_Samara_EBA_Yamnaya + 28% Turkey_Boncuklu_N.SG + 15% Latvia_HG + 11% Nganasan
    .335: 41% Finland_Levanluhta + 35% Russia_Samara_EBA_Yamnaya + 24% Turkey_Boncuklu_N %vanaya
    Russia_Samnaya_25_Boncuklu_N% % Turkey_Boncuklu_N.SG + 2% Nganasan
    .011: 38% Russia_Samara_EBA_Yamnaya + 34% Turkey_Boncuklu_N.SG + 23% Russia_Bolshoy + 5% Latvia_HG
    .010: 46% Russia_Samara_EBA_Yamnaya + 34%
    Turkey_NGoncuklukluk : 42% Russia_Samara_EBA_Yamnaya + 35% Turkey_Boncuklu_N.SG + 24% Russia_Bolshoy
    .004: 49% Russia_Samara_EBA_Yamnaya + 33% Turkey_Boncuklu_N.SG + 11% Nganasan + 7% Russia_HG_Karelia
    .002: 58% Russia_Samara_EBA_Yamnaya + 31% Turkey_Boncuklu_N.SG + 12% Nganasan
    7e-11: 50% Finland_Levanluhta + 35% Turkey_Boncuklu_N.SG + 16% Russia_HG_Karelia
    5e-12: 59% Finland_Levan_Bluhta + Latvia.SG
    7e-13: 52% Turkey_Boncuklu_N.SG + 39% Russia_HG_Karelia + Nganasan 9%
    4e-13: 69% Finland_Levanluhta + Turkey_Boncuklu_N.SG 31%
    2e-14: 52% Turkey_Boncuklu_N.SG + 27% Russia_HG_Karelia + Russia_Bolshoy 21%
    1e-19 : 46% Turkey_Boncuklu_N.SG + 28% Russia_Bolshoy + 26% Latvia_HG
    1e-23: 71% Finland_Levanluhta + 29% Russia_Samara_EBA_Yamnaya
    4e-24: 54% Turkey_Boncuklu_N.SG + 46% Russia_HG_Karelia% Turkey 2N
    -26_G Latvia_HG + 12% Nganasan
    7e-27: 63% Turkey_Boncuklu_N.SG + 37% Russia_Bolshoy
    8e-36: 55% Russia_Samara_EBA_Yamnaya + 36% Turkey_Boncuklu_N.SG + 9% Latvia_HG
    6e-38: 63% Russia_Samara_EBA_Yamnaya + 37% Turkey_NBoncuk
    5% .SG + Nganasan 15%
    4e-53: 56% Turkey_Boncuklu_N.SG + Latvia_HG 44%
    7e-61: 85% Russia_Samara_EBA_Yamnaya + 10% Nganasan + Latvia_HG 5%
    4e-62: 90% Russia_Samara_EBA_Yamnaya + Nganasan 10%
    3e-76: 98 % Russia_HG_Karelia + 2% Nganasan
    5e-78: 88% Russia_Samara_EBA_Yamnaya + 12% Russia_Bolshoy
    2e-116: 91% Latvia_HG + 9% Nganasan
    1e-128: 92% Latvia_HG + 8% Russia_Bolshoy

    plot_map

    The `plot_map` function uses Plotly to generate an interactive map for entries from an anno file. You can see the code of the function by running `plot_map`.

    Code:
    library(admixtools)
    library(tidyverse)

    t=as.data.frame(read_tsv(gsub("\u00a0"," ",read_file("path/to/v44.3_HO_public.anno")))) # remove non-breaking spaces so `as.numeric` won't fail
    t=t[,c(8,11,12,6)] # group, latitude, longitude, years BP
    t=t[!grepl("\\.REF|rel\\.|fail\\.|Ignore_|_dup|_contam |_lc|_father|_mother|_son|_daughter|_brother|_sist er|_sibling|_twin|Neanderthal|Denisova|Vindija_lig ht|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chi mp|hg19ref|_o|_outlier",t[,1]),]
    t=t[t[,2]!=".."&t[,3]!=".."&t[,2]!="-"&t[,3]!="-",] # remove lines with missing lat or long field (can be `..` or `-`)
    t[t[,4]==0,4]=1 # change years BP 0 to 1 because entries with years BP 0 are not plotted when using a log scale (which is the default option)
    a=aggregate(apply(t[,-1],2,as.numeric),list(t[,1]),mean)
    a[,5]=a[,1]
    names(a)=c("iid","lat","lon","yearsbp","group")

    admixtools::plot_map(a)



    A second option is to first run this in a shell:

    Code:
    $ igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref ')
    $ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
    $ awk -F\\t 'NR>1&&$11!=".."&&$11!="-"{print$8,$11,$12,$6}' path/to/v44.3_HO_public.anno|igno|grep -v _o|tav \ >map
    $ head -n2 map
    Ireland_EBA.SG 55.292132 -6.191685 3765.333333
    Germany_EN_LBK 49.674249 9.582716 7066.370370

    Then run this in R:

    Code:
    t=read.table("map");t[t[,4]==0,4]=1;t[,5]=t[,1];names(t)=c("iid","lat","lon","yearsbp","group");p lot_map(t)'

    A third option is to use Plotly directly:

    Code:
    library(plotly)

    a=read.table("map")
    label=ifelse(a[,4]==0,a[,1],paste0(a[,1],"\u00a0",round(a[,4])))
    pl=plot_ly(a,type="scattermapbox",mode="markers",x =a[,3],y=a[,2],text=a[,1],hovertext=label,hoverinfo="text",marker=list(size =8,color=log10(a[,4]),colorscale="Portland",colorbar=list(thickness=15 ,outlinewidth=0)))
    plotly::layout(pl,mapbox=list(style="stamen-terrain"),margin=list(t=0,r=0,b=0,l=0,pad=0))

    In order to display the names of populations in addition to dots, you can use `mode="markers+text"`. It is currently only supported by the mapbox map styles which require a free access token: https://community.plotly.com/t/scatt...-as-text/35065. The `satellite-streets` style also includes the names of cities but `satellite` doesn't.

    Code:
    library(plotly)

    a=read.table("map")
    label=ifelse(a[,4]==0,a[,1],paste0(a[,1],"\u00a0",round(a[,4])))
    pl=plot_ly(a,type="scattermapbox",mode="markers+te xt",x=a[,3],y=a[,2],text=label,textfont=list(color="white"),textposit ion="top",marker=list(color=log10(a[,4]),size=8,colorscale="Jet"))
    plotly::layout(pl,mapbox=list(style="satellite-streets",accesstoken="youraccesstoken"),margin=lis t(t=0,r=0,b=0,l=0,pad=0))




    FST or f2 heatmap

    The default clustering method used by `pheatmap` is equivalent to `hclust(dist(t))`. Here the input is already a distance matrix, so you can use this instead: `clustering_callback=function(...)hclust(as.dist(t ))`.

    Code:
    library(admixtools)
    library(pheatmap)
    library(vegan) # for reorder.hclust
    library(colorspace) # for hex

    pop=unlist(strsplit("Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Saami.DG Selkup Udmurt"," "))

    f=fst("path/to/v44.3_HO_public",pop)
    # f=f2("path/to/v44.3_HO_public",pop)

    # convert FST or f2 pairs to square matrix
    df=as.data.frame(f[,-4])
    df2=rbind(df,setNames(df[,c(2,1,3)],names(df)))
    t=xtabs(df2[,3]~df2[,1]+df2[,2])

    weights=t["Hungarian",]-t["Nganasan",]
    # weights=rowMeans(t) # sort by average distance to other populations

    pheatmap (
    1000*t,filename="a.png",
    clustering_callback=function(...)reorder(hclust(as .dist(t)),weights),
    # clustering_callback=function(...)hclust(as.dist(t) ),
    legend=F,border_color=NA,cellwidth=18,cellheight=1 8,
    treeheight_row=80,treeheight_col=80,
    display_numbers=T,number_format="%.0f",number_colo r="black",
    colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0) ,c(0,.5,.5,.5,.5,.5,.5),1)))(256)
    )

    Here the FST values are multiplied by 1,000 and not by 10,000 as usual:

    “Man cannot live without a permanent trust in something indestructible in himself, and at the same time that indestructible something as well as his trust in it may remain permanently concealed from him.”

    Franz Kafka

  2. #2
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    The script in this post generates qpAdm models for a set of target populations, selects the model with the highest p-value for each target population, and displays the models as a stacked bar chart.

    Code:
    library(admixtools)
    library(tidyverse)
    library(reshape2) # for melt
    library(colorspace) # for hex
    library(cowplot) # for plot_grid

    left=c("Turkey_N_published","Estonia_CordedWare.SG ","Latvia_HG","Russia_HG_Samara","Russia_HG_Tyumen ","Nganasan")
    right=c("Russia_Ust_Ishim_HG_published.DG","Kariti ana.DG","Papuan.DG","ONG.SG","Iran_Wezmeh_N.SG","B elgium_UP_GoyetQ116_1_published_all","Turkey_Epipa leolithic","Georgia_Kotias.SG","Japanese")
    target=c("Finnish.DG","Estonian.DG","Selkup","Hung arian.DG","Karelian","Mansi","Udmurt","Mordovian", "Saami.DG","Besermyan","Finnish","Enets","Mari.SG" ,"Finland_Saami_Modern.SG","Russia_Bolshoy","Norwa y_Viking_o1.SG","Russia_Chalmny_Varre")

    f2=f2_from_geno("path/to/v44.3_HO_public",pops=c(left,right,target))
    qp=sapply(target,function(x)qpadm(f2,left,right,x) )

    # select model with highest p-value
    qp2=apply(qp,2,function(x)x$popdrop%>%filter(feasi ble==T&f4rank!=0)%>%top_n(1,p)%>%select(5,7:last_c ol(5)))

    # select model with highest number of left populations; choose highest p-value when tied
    # qp2=apply(qp,2,function(x)x$popdrop%>%filter(feasi ble==T)%>%arrange(desc(f4rank),desc(p))%>%head(1)% >%select(5,7:last_col(5)))

    qp3=do.call("rbind",qp2)
    qp3$target=rownames(qp3)
    qp3=qp3[order(qp3[,"Nganasan"],na.last=F),]
    # nona=qp3;nona[is.na(nona)]=0;qp3=qp3[order(3*nona[,"Nganasan"]+nona[,"Russia_HG_Tyumen"]+.8*nona[,"Russia_HG_Samara"]),]
    qp4=melt(qp3[,-1],id.var="target")

    lab=sub("e-0","e-",sub("^0","",sprintf(ifelse(qp3$p<.001,"%.0e","%. 3f"),qp3$p)))
    capt=str_wrap(paste("Outgroups:",paste(sort(right) ,collapse=", ")),width=80)

    p=ggplot(qp4,aes(x=fct_rev(factor(target,level=qp3 $target)),y=value,fill=variable))+
    geom_bar(stat="identity",width=1,position=position _fill(reverse=T))+
    geom_text(aes(label=round(100*value)),position=pos ition_stack(vjust=.5,reverse=T),size=3.5)+
    labs(caption=capt)+
    coord_flip()+
    scale_x_discrete(expand=c(0,0),labels=rev(paste0(q p3$target," (",lab,")")))+
    scale_y_discrete(expand=c(0,0))+ # remove gray area on left and right side of bars
    scale_fill_manual(values=hex(HSV(c(30,60,180,210,1 20,330),.5,1)))+ # manual colors
    # scale_fill_manual(values=hex(HSV(head(seq(0,360,le ngth.out=length(unique(qp4$variable))+1),-1),.5,1)))+ # automatic colors
    guides(fill=guide_legend(ncol=3,byrow=F))+
    theme(
    axis.text=element_text(color="black",size=11),
    axis.text.x=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.direction="horizontal",
    legend.key=element_rect(fill=NA), # remove gray border around color squares
    legend.spacing.y=unit(.03,"in"), # reduce vertical space between color squares
    legend.margin=margin(0,0,-4,0),
    legend.text=element_text(size=10),
    legend.title=element_blank(),
    plot.caption=element_text(size=10)
    )

    hei=c(.05+.25*ceiling(length(unique(qp4[!is.na(qp4$value),2]))/3),.25*nrow(qp3)+.05+.18*(str_count(capt,"\n")+1))
    ggdraw(plot_grid(get_legend(p),p+theme(legend.posi tion="none"),ncol=1,rel_heights=hei))
    ggsave("a.png",width=6,height=sum(hei))







    The script in this post calculates a matrix of f3 distances between each pair of specified populations. It displays the distance to one specified population on the x axis and to another specified population on the y axis, but the clustering is based on the full matrix. The distances are displayed as min-max scaled percentages.


    Code:
    library(admixtools)
    library(tidyverse)
    library(ggrepel)

    p1=c("Altaian","Altaian_Chelkan","Azeri","Balkar", "Bashkir","Besermyan","Buryat","Chuvash","Daur","D olgan","Enets","Estonian","Even","Evenk_FarEast"," Evenk_Transbaikal","Finnish","Gagauz","Hungarian", "Kalmyk","Karachai","Karakalpak","Karelian","Kazak h","Kazakh_China","Khakass","Khakass_Kachin","Kumy k","Kyrgyz_China","Kyrgyz_Kyrgyzstan","Kyrgyz_Taji kistan","Mansi","Mongol","Mordovian","Nanai","Negi dal","Nganasan","Nogai_Astrakhan","Nogai_Karachay_ Cherkessia","Nogai_Stavropol","Salar","Selkup","Sh or_Khakassia","Shor_Mountain","Tatar_Kazan","Tatar _Mishar","Tatar_Siberian","Tatar_Siberian_Zabolotn iye","Todzin","Tofalar","Tubalar","Turkish","Turki sh_Balikesir","Turkmen","Tuvinian","Udmurt","Uyghu r","Uzbek","Veps","Xibo","Yakut","Yukagir_Forest", "Yukagir_Tundra")
    p2=c("Korean","Yakut")
    p3="Khomani"

    f3=f3("path/to/v44.3_HO_public",c(p1,p2),c(p1,p2),p3)

    t=f3%>%select(pop1,pop2,est)%>%pivot_wider(names_f rom=pop1,values_from=est)%>%data.frame(row.names=1 )
    t2=t[,p2]

    t2=as.data.frame(apply(t2,2,function(x)100*(x-min(x))/(max(x)-min(x))))

    t2$k=cutree(hclust(dist(t)),12)
    names(t2)=c("x","y","k")

    ggplot(t2,aes(x,y))+
    geom_abline(linetype="dashed",color="gray80",size= .2)+
    geom_polygon(data=t2%>%group_by(k)%>%slice(chull(x ,y)),alpha=.2,aes(color=as.factor(k),fill=as.facto r(k)),size=.2)+
    geom_point(aes(color=as.factor(k)),size=.7)+
    geom_text_repel(aes(label=rownames(t2),color=as.fa ctor(k)),max.overlaps=Inf,force=2,size=2.2,segment .size=.2,min.segment.length=.2,box.padding=.05)+
    xlab(paste0("100 × minmax(f3(x, ",p2[1],", ",p3,"))"))+
    ylab(paste0("100 × minmax(f3(x, ",p2[2],", ",p3,"))"))+
    scale_x_continuous(breaks=seq(0,100,10))+
    scale_y_continuous(breaks=seq(0,100,10))+
    scale_color_manual(values=rainbow_hcl(n_distinct(t 2$k),100,50))+
    scale_fill_manual(values=rainbow_hcl(n_distinct(t2 $k),100,50))+
    theme(
    axis.text=element_text(color="black",size=6),


    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
    axis.ticks=element_blank(),
    axis.ticks.length=unit(0,"pt"),
    axis.title=element_text(size=8),
    legend.position="none",
    panel.background=element_rect(fill="white"),
    panel.border=element_rect(color="gray90",fill=NA,s ize=.4),
    panel.grid.major=element_line(color="gray90",size= .2)
    )

    ggsave("a.png",width=6,height=6)


  3. #3
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    This is from another thread by same author:


    Shell and R scripts for SmartPCA and ADMIXTURE


    Running SmartPCA and ADMIXTURE

    The series of shell commands below downloads the 1240K+HO dataset and converts it to the PLINK format. It removes duplicate samples by selecting unique samples by the version ID, which is the third column in the anno file. It uses `plink --genome` to calculate IBD and IBS values, and it removes related samples by excluding samples with a PI_HAT value of .25 or above after LD pruning. It then runs SmartPCA with outlier removal not disabled, because without outlier removal often dimensions after the first few dimensions are dominated by populations with outliers. It generates G25-style CSV files for the output of SmartPCA. Finally it runs ADMIXTURE with the outliers detected by SmartPCA removed.

    Code:
    curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
    f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)

    printf %s\\n Altaian Altaian_Chelkan Azeri Balkar Bashkir Besermyan Buryat Chuvash Daur Daur.DG Daur.SDG Dolgan Enets Estonian.DG Even.DG Evenk_FarEast Evenk_Transbaikal Finnish Finnish.DG Gagauz Kazakh Kazakh Hungarian Hungarian. Kyrgysh_China Kyrgyz_Kyrgyzstan Kyrgyz_kyrgyshstnkDG Kyrgyz_Tajikistan Kyrgyz_Tajikstan Mansi Mongol Mordovian Nanai Negidal Nganasan Nogai Nogai_Astrakhan Nogai_krcy_Cherkessia Nogai_Stavropol Oroqen.DG Oroqen.SDG Saami.DG Salar Selkup Shor_Khakassia Shor_mounten Tatar_Kazan Tatar_Mishar Tatar_saiberian Tatar_saiberian_Zabolotniye Todzin Tofalar Tubalar Turkish Turkish.DG Turkish_Balikesir Turkmen Tuvinian Udmurt Uyghur Uyghur.DG Uzbek Veps Xibo Xibo.DG Xibo.SDG Yakut Yakut.DG Yakut.SDG Yukagir_Forest Yukagir_Tundra>pop
    x=uralaltaic;sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' pop ->$x.temp.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.temp.pick v44.3_HO_public.fam) --indep-pairwise 50 10 .1 --make-bed --out $x.temp
    plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
    awk 'FNR>1&&$10>=.25{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.temp.pick>$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x

    smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval autoshrink:\ YES lsqproject:\ YES numoutevec:\ 10 numoutlieriter:\ 5)|tee $x.smartpca
    sed 1d $x.evec|tr : \ |awk 'NR==FNR{a[$1]=$2;next}{$1=a[$2]}1' $x.pick -|sed 's/ /:/'|sed 's/ [^ ]*$//'|tr ' ' ,>$x.csv
    tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
    sed 's/:[^,]*//' $x.csv|tav ,>$x.ave
    for y in ave csv;do awk -F, 'NR==FNR{a[NR]=$0;next}{printf"%s",$1;for(i=2;i<=NF;i++)printf", %.6f",$i*a[i-1]^.5;print""}' $x.$y $x.$y>$x.scaled;done

    plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .1 --out $x
    plink --bfile $x --extract $x.prune.in --remove <(grep ^REMOVED\ outlier $x.smartpca|cut -d' ' -f3|cut -d: -f2|awk 'NR==FNR{a[$0];next}$2 in a' - $x.fam) --make-bed --out $x.pruned
    k=3
    admixture -j4 -C .1 $x.pruned.bed $k
    paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.pruned.fam) $x.pruned.$k.Q>$x.$k
    cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave

    Making ADMIXTURE run faster

    ADMIXTURE runs in a single-threaded mode by default, but if your processor has 4 cores, you can add the option `-j4` to make ADMIXTURE run 4 times faster.

    With the option `-C .1`, the ADMIXTURE run typically finishes about 2-4 times faster. It terminates the run after the log likelihood increases by less than .1 between iterations (instead of the default .0001). When I tried the same run with and without the `-C .1` option, the average difference in admixture weights was only about .15%:

    Code:
    $ Rscript -e 'mean(abs(as.matrix(read.table("uralaltaic.pruned. 3.Q")-read.table("uralaltaic2.pruned.3.Q"))))'

    [1] 0.001502451
    Performing more aggressive LD pruning also makes ADMIXTURE run faster. An example in the ADMIXTURE manual uses a fairly aggressive R^2 threshold (.1) for the `--indep-pairwise` option of PLINK (http://dalexander.github.io/admixtur...ure-manual.pdf):

    ### 2.3 Do I need to thin the marker set for linkage disequilibrium?

    We tend to believe this is a good idea, since our model does not explicitly take LD into consideration, and since enormous data sets take more time to analyze. It is impossible to "remove" all LD, especially in recently-admixed populations, which have a high degree of "admixture LD". Two approaches to mitigating the effects of LD are to include markers that are separated from each other by a certain genetic distance, or to thin the markers according the observed sample correlation coefficients. The easiest way is the latter, using the `--indep-pairwise` option of PLINK. For example, if we start with a file `rawData.bed`, we could use the following commands to prune according to a correlation threshold and store the pruned dataset in `prunedData.bed`:

    > `% plink --bfile rawData --indep-pairwise 50 10 0.1`
    >
    > (output indicating number of SNPs targeted for inclusion/exclusion)
    >
    > `% plink --bfile rawData --extract plink.prune.in --make-bed --out prunedData`


    Specifically, the first command targets for removal each SNP that has an _R²_ value of greater than 0.1 with any other SNP within a 50-SNP sliding window (advanced by 10 SNPs each time). The second command copies the remaining (untargetted) SNPs to `prunedData.bed`.

    This approach is imperfect but seems to work well in practice. Please read our paper for more information.

    Sources of the samples

    Code:
    $ curl -sO https://reichdata.hms.harvard.edu/pu...HO_public.anno
    $ awk -F\\t 'NR==FNR{a[$0];next}$8 in a&&!a[$4,$8]++{print$8,$4}' pop v44.3_HO_public.anno|sort|awk '{a[$2]=a[$2]" "$1}END{for(i in a)print i":"a[i]}'|sort
    BergstromScience2020: Daur.SDG Oroqen.SDG Xibo.SDG Yakut.SDG
    FlegontovNature2019: Enets Nganasan Selkup
    Jeodaghntureachologyevolution20l9: Altaian Altaian_Chelkan Azeri Bashkir Besermyan Buryat Chuvash Evenk_FarEast Evenk_Transbaikal Gagauz Karachai Karakalpak Karelian Kazakh Khakass Khakass_Kachin Kyrgyz_Tajikistan Mongol Mordovian Nanai Negidal Nogai_Astrakhan Nogai_Stavropol Shor_Khakassia Shor_mounten Tatar_Kazan Tatar_Mishar Tatar_saiberian Tatar_saiberian_Zabolotniye Todzin Tubalar Tuvinian Udmurt Uzbek Veps
    LazaridisNature2014: Altaian Azeri.WGA Balkar Chuvash Dolgan Even Finnish Hungarian Kalmyk Kumyk Kyrgyz_Kyrgyzstan Mansi Mordovian Nganasan Nogai_Karachay_Cherkessia Selkup Tofalar Tubalar Turkish Turkish_Balikesir Turkmen Tuvinian Uzbek Yukagir_Forest Yukagir_Tundra
    MallickNature2016: Daur.DG Estonian.DG Even.DG Finnish.DG Hungarian.DG Kyrgyz_Kyrgyzstan.DG Oroqen.DG Saami.DG Turkish.DG Uyghur.DG Xibo.DG Yakut.DG
    PattersonGenetics2012: The Uyghur Xibo Yakut Cycle
    WangBioRxiv2020: Kazakh_China Kyrgyz_China Salar

    Plotting the output of SmartPCA with Voronoi tessellation

    The script below uses the `ggforce` package to plot the result of SmartPCA using Voronoi tessellation: https://ggforce.data-imaginist.com/r...om_delvor.html.

    When I tried to install `ggforce`, it failed at first because its dependency `units` failed to install. By running `install.packages("units")`, I saw that I needed to run `brew install udunits` (`libudunits2-dev` on Debian and `udunits2-devel` on RPM).

    Code:
    library(tidyverse)
    library(ggforce)

    n="uralaltaic"
    t=read.csv(paste0(n,".csv"),header=F,row.names=1)
    eig=head(as.double(readLines(paste0(n,".eval"))),n col(t))
    pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

    names(t)=paste0("PC",1:ncol(t))
    ranges=apply(t,2,function(x)abs(max(x)-min(x)))

    pop=sub(":.*","",rownames(t))
    pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
    centers=aggregate(t,by=list(pop),mean)
    t$pop=pop

    set.seed(1234)
    color=as.factor(sample(seq(1,length(unique(t$pop)) )))
    col=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60 ,100),c(20,50),c(15,40))
    hues=max(ceiling(length(color)/nrow(col)),2)
    pal1=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],x[1],x[2])))
    pal2=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],ifelse(x[2]>50,.8*x[1],.2*x[1]),ifelse(x[2]>50,.3*x[2],100))))

    for(i in seq(1,7,2)){
    p1=sym(paste0("PC",i))
    p2=sym(paste0("PC",i+1))

    maxrange=max(ranges[c(i,i+1)])
    lims=sapply(c(i,i+1),function(x)mean(range(t[,x]))+c(-.535*maxrange,.535*maxrange))

    ggplot(t,aes(!!p1,!!p2,group=-1L))+ # the group needs to be constant for geom_voronoi_tile; -1L is a convention
    geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=maxrange/40)+
    geom_label(data=centers,aes(x=!!p1,y=!!p2,label=Gr oup.1,color=color,fill=color),alpha=.7,size=2,labe l.r=unit(.08,"lines"),label.padding=unit(.08,"line s"),label.size=0)+
    coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
    labs(x=pct[i],y=pct[i+1])+
    scale_x_continuous(breaks=seq(-1,1,.05))+
    scale_y_continuous(breaks=seq(-1,1,.05))+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal2)+
    theme(
    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
    axis.text=element_text(color="black",size=6),
    axis.ticks=element_blank(),
    axis.ticks.length=unit(0,"pt"),
    axis.title=element_text(color="black",size=8),
    legend.position="none",
    panel.background=element_rect(fill="white"),
    panel.border=element_rect(color="gray85",fill=NA,s ize=.4),
    panel.grid.major=element_line(color="gray85",size= .2)
    )

    ggsave(paste0(i,".png"),width=7,height=7)
    }




    Plotting population averages in SmartPCA output

    The clusting is based on hierarchical clustering using the complete linkage method (farthest neighbor clustering). The clustering is based on only the first 6 eigenvectors, where the eigenvectors have been multiplied by the square roots of the eigenvalues.

    Code:
    library(tidyverse)
    library(ggforce)
    library(ggrepel)

    f="uralaltaic"
    t=read.table(paste0(f,".evec"),row.names=1)
    t=t[,-ncol(t)]
    eig=as.double(readLines(paste0(f,".eval")))[1:ncol(t)]

    pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

    ind=read.table("path/to/v44.3_HO_public.ind",row.names=1)

    names(t)=c(paste0("PC",1:ncol(t)))
    rownames(t)=sub(".*:","",rownames(t))
    t=data.frame(aggregate(t,list(sub("\\.(DG|SDG|SG)" ,"",ind[rownames(t),2])),mean),row.names=1)

    t$k=cutree(hclust(dist(t(t(t)*sqrt(eig))[,1:6])),k=16)

    ggplot(t,aes(x=PC1,y=PC2))+
    geom_point(aes(color=as.factor(k)),size=.5)+
    ggforce::geom_mark_hull(aes(color=as.factor(k),fil l=as.factor(k)),concavity=100,radius=unit(.2,"cm") ,expand=unit(.2,"cm"),alpha=.2,size=.2)+
    # geom_polygon(data=t%>%group_by(k)%>%slice(chull(PC 1,PC2)),alpha=.2,aes(color=as.factor(k),fill=as.fa ctor(k)),size=.25)+
    ggrepel::geom_text_repel(aes(label=rownames(t),col or=as.factor(k)),max.overlaps=Inf,force=5,size=2,b ox.padding=.0,min.segment.length=.1,segment.size=. 2)+
    # geom_text(aes(label=rownames(t),color=as.factor(k) ),size=2,vjust=-.6)+
    scale_x_continuous(breaks=seq(-2,2,.02),expand=expansion(mult=.10))+
    scale_y_continuous(breaks=seq(-2,2,.02),expand=expansion(mult=.06))+
    labs(x=pct[1],y=pct[2])+
    scale_color_manual(values=hcl(head(seq(0,360,lengt h.out=length(unique(t$k))+1),-1),95,50))+
    scale_fill_manual(values=hcl(head(seq(0,360,length .out=length(unique(t$k))+1),-1),120,60))+
    theme(
    aspect.ratio=1,
    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
    axis.text=element_text(color="black",size=6),
    axis.ticks.length=unit(0,"pt"),
    axis.ticks=element_blank(),
    axis.title=element_text(color="black",size=8),
    legend.position="none",
    panel.background=element_rect(fill="white"),
    panel.border=element_rect(color="gray80",fill=NA,s ize=.4),
    panel.grid.major=element_line(color="gray80",size= .2),
    plot.background=element_rect(fill="white")
    )

    ggsave("a.png",width=6,height=6)


    Making a heatmap of population averages in an ADMIXTURE run

    Code:
    library(pheatmap)
    library(colorspace) # for hex
    library(vegan) # for reorder.hclust

    t=read.table("uralaltaic.3.ave",row.names=1,header =F)

    # t=t[,c(1,3,2)]

    t=data.frame(aggregate(t,list(sub("\\.(DG|SDG|SG|W GA)","",rownames(t))),mean),row.names=1)

    pheatmap (
    100*t,
    filename="a.png",
    cluster_cols=F,
    show_colnames=F,
    clustering_callback=function(...)reorder(hclust(di st(t)),(1:ncol(t))%*%t(t)),
    legend=F,
    treeheight_row=80,
    cellwidth=16,
    cellheight=16,
    fontsize=10,
    border_color=NA,
    display_numbers=T,
    number_format="%.0f",
    fontsize_number=8,
    number_color="black",
    colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0) ,c(0,.5,.5,.5,.5,.5,.5),1)))(256)
    )



    Making a ternary diagram of an ADMIXTURE run with 3 components

    Plot population averages with clustering based on a second K=8 ADMIXTURE run:

    Code:
    library(tidyverse)
    library(ggforce)
    library(ggrepel)

    f="uralaltaic"
    t=read.table(paste0(f,".3.ave"),row.names=1)

    t=data.frame(aggregate(t,list(sub("\\.(DG|SDG)","" ,rownames(t))),mean),row.names=1)

    tri=rbind(c(0,sqrt(3)/3),c(-.5,-sqrt(3)/6),c(.5,-sqrt(3)/6))
    xy=as.data.frame(as.matrix(t)%*%tri)

    grid=apply(rbind(c(1,2,3,2),c(1,3,2,3),c(2,1,3,1)) ,1,function(x)cbind(
    seq(tri[x[1],1],tri[x[2],1],length.out=11),
    seq(tri[x[1],2],tri[x[2],2],length.out=11),
    seq(tri[x[3],1],tri[x[4],1],length.out=11),
    seq(tri[x[3],2],tri[x[4],2],length.out=11)
    )%>%as.data.frame)%>%bind_rows

    k8=read.table(paste0(f,".8.ave"),row.names=1)
    xy$k=as.factor(cutree(hclust(dist(k8[rownames(t),])),k=10))

    expand=.04
    xlim=c(-.5-expand,.5+expand)
    ylim=c(-sqrt(3)/6-expand,sqrt(3)/3+expand)

    ggplot(xy,aes(x=V1,y=V2))+
    # geom_polygon(data=as.data.frame(tri),fill="gray95" )+ # uncomment to use a gray background for the triangle
    geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray85",size=.3)+
    geom_mark_hull(aes(group=k,color=k,fill=k),concavi ty=100,radius=unit(.15,"cm"),expand=unit(.15,"cm") ,alpha=.2,size=.2)+
    geom_point(aes(color=k),size=.5)+
    geom_text_repel(aes(label=rownames(xy),color=k),ma x.overlaps=Inf,force=5,size=2.3,box.padding=.05,se gment.size=.3,min.segment.length=.15)+
    coord_fixed(expand=F,xlim=xlim,ylim=ylim)+
    theme(
    axis.text=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none",
    panel.background=element_rect(fill="white"),
    plot.margin=margin(0,0,0,0)
    )

    ggsave("a.png",width=7,height=7*diff(ylim)/diff(xlim))


    Plot individual samples with Voronoi tessellation:


    Code:
    library(tidyverse)
    library(ggforce)

    t=read.table("uralaltaic.3")
    rownames(t)=paste0(t[,2],":",t[,1])
    t=t[,-c(1,2)]

    tri=rbind(c(0,sqrt(3)/3),c(-.5,-sqrt(3)/6),c(.5,-sqrt(3)/6))
    xy=as.data.frame(as.matrix(t)%*%tri)

    grid=apply(rbind(c(1,2,3,2),c(1,3,2,3),c(2,1,3,1)) ,1,function(x)cbind(
    seq(tri[x[1],1],tri[x[2],1],length.out=11),
    seq(tri[x[1],2],tri[x[2],2],length.out=11),
    seq(tri[x[3],1],tri[x[4],1],length.out=11),
    seq(tri[x[3],2],tri[x[4],2],length.out=11)
    )%>%as.data.frame)%>%bind_rows

    pop=sub(":.*","",rownames(xy))
    pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
    centers=aggregate(xy,by=list(pop),mean)
    xy$pop=pop

    set.seed(1234)
    color=as.factor(sample(seq(1,length(unique(xy$pop) ))))
    col=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60 ,100),c(20,50),c(15,40))
    hues=max(ceiling(length(color)/nrow(col)),2)
    pal1=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],x[1],x[2])))
    pal2=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],ifelse(x[2]>50,.8*x[1],.2*x[1]),ifelse(x[2]>50,.3*x[2],100))))

    # add a small random factor to the coordinates; without this geom_voronoi_tile can fail because of too many overlapping points
    xy$V1=xy$V1+runif(nrow(xy))/1e5
    xy$V2=xy$V2+runif(nrow(xy))/1e5

    expand=.05
    xlim=c(-.5-expand,.5+expand)
    ylim=c(-sqrt(3)/6-expand,sqrt(3)/3+expand)

    ggplot(xy,aes(x=V1,y=V2,group=-1L))+ # the group needs to be a constant for geom_voronoi_tile; -1L is a convention
    # geom_polygon(data=as.data.frame(tri),fill="gray95" )+
    geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray85",size=.3)+
    geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.025)+
    geom_label(data=centers,aes(x=V1,y=V2,label=Group. 1,color=color,fill=color),alpha=.7,size=2,label.r= unit(.07,"lines"),label.padding=unit(.07,"lines"), label.size=0)+
    coord_fixed(expand=F,xlim=xlim,ylim=ylim)+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal2)+
    theme(
    axis.text=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none",
    panel.background=element_rect(fill="white"),
    plot.margin=margin(0,0,0,0,"in")
    )

    ggsave("a.png",width=7,height=7*diff(ylim)/diff(xlim))



    Calculating FST distances with SmartPCA

    When the `fstonly: YES` parameter is specified, a matrix of FST values is printed to STDOUT as integers multiplied by thousand. `phylipoutname: fstfilename` saves an FST matrix to a file, but in the file the FST values have only three digits after the decimal point. The `fsthiprecision: YES` parameter causes the FST values that are printed to STDOUT to be multiplied by million instead of thousand, but it doesn't affect the contents of the `phylipoutname` file.

    In order to return FST values, the sixth field of the fam file needs to contain integers that identify population groups: https://www.biostars.org/p/266511/. The commands below use integers starting from 10 as group identifiers, because the numbers 1, 2, and 9 have a special meaning (1 assigns the line as a case, 2 assigns it as a control, and 9 ignores it).

    If an FST run includes more than 100 populations, SmartPCA exits with an error unless you include a parameter like `maxpops: 1000`.

    Code:
    x=uralic
    printf %s\\n Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Saami.DG Selkup Udmurt Veps>$x.pop
    sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' pop ->$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
    awk '!a[$2]++{i++}{print$1,i}' $x.pick|awk 'NR==FNR{a[$1]=$2;next}{$6=a[$2]+9}1' - $x.fam>$x.famtemp;mv $x.fam{temp,}
    smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam fstonly:\ YES phylipoutname:\ $x.fst fsthiprecision:\ YES)|tee $x.smartpca
    p=$(cut -d' ' -f2 $x.pick|awk '!a[$0]++');sed -n '/fst \*1000000/,/^$/p' $x.smartpca|sed 1,2d|sed \$d|tr -s ' ' ,|cut -d, -f3-|paste -d, <(echo "$p") -|cat <(printf %s\\n '' "$p"|paste -sd,) ->$x.fst

    I don't know if you're supposed to do LD pruning before calculating FST, because Kerminen et al. 2021 said this: "We calculated pairwise-FST between the reference groups (Fig 2) and the ancestor candidate groups (S9 Fig) using SmartPCA of EIGENSOFT package[7] (fstonly: YES, fsthiprecision: YES) and 56,661 LD-independent variants."

    Making an FST heatmap

    The default clustering method used by `pheatmap` is equivalent to `hclust(dist(t))`, but here the input is already a distance matrix, so you can use this instead: `clustering_callback=function(...)hclust(as.dist(t ))`. The code below additionally uses `vegan::reorder.hclust` to reorder the branches of the heatmap.

    Code:
    library(pheatmap)
    library(colorspace) # for hex
    library(vegan) # for reorder.hclust

    t=read.csv("uralic.fst",row.names=1)

    # recalculate population averages with suffixes like `.DG` removed
    # tav=function(x,y)data.frame(aggregate(x,list(y),me an),row.names=1)
    # rn=sub("\\.(DG|SDG|SG|WGA)","",rownames(t))
    # t=tav(t(tav(t,rn)),rn)

    # rand=sample(nrow(t),12);t=t[rand,rand]

    # find population pair with highest distance and calculate weights for sorting clusters
    max=which(t==max(t))[1]
    weight=t[,max%%nrow(t)]-t[,max%/%nrow(t)+1]

    pheatmap (
    t/1000,
    filename="a.png",
    legend=F,
    clustering_callback=function(...)reorder(hclust(as .dist(t)),weight),
    # clustering_callback=function(...)reorder(hclust(as .dist(t)),cmdscale(as.dist(t))[,1]), # sort by first dimension in MDS
    # clustering_callback=function(...)reorder(hclust(as .dist(t)),rowMeans(t)), # sort by average distance to other populations
    # clustering_callback=function(...)hclust(as.dist(t) ),
    cellwidth=18,
    cellheight=18,
    fontsize=9,
    treeheight_row=80,
    treeheight_col = 80,
    border_color=NA,
    display_numbers=T,
    number_format="%.0f",
    fontsize_number=9,
    number_color="black",
    colorRampPalette(hex(HSV(c(210,210,120,60,45,30,15 ,0),c(0,.5,.5,.5,.5,.5,.5,.5),1)))(256)
    )



    Compiling EIGENSOFT on macOS

    The original version of PLINK on the website hosted by Harvard is no longer updated, and the Mac binaries on the website by Harvard didn't work on my computer. However the website of the new maintainer had working Mac binaries: https://www.cog-genomics.org/plink/. The code in this thread uses PLINK 1.9 and not PLINK 2.

    There's Mac and GNU/Linux binaries for ADMIXTURE here: https://github.com/NovembreLab/admixture.

    There are Mac binaries for an old version of EIGENSOFT here: https://github.com/chrchang/eigensoft. However for example the old version of SmartPCA is missing the `lsqproject` and `autoshrink` options. Therefore I compiled a newer version of EIGENSOFT manually: https://www.hsph.harvard.edu/alkes-price/software/.

    I first ran `brew install gsl homebrew/science/openblas` as suggested in the README. It failed before I ran `brew uninstall --ignore-dependencies suite-sparse`.

    I initially got errors like this because the C code included implicit function declarations:

    Code:
    $ cd EIG-7.2.1/src
    $ make CFLAGS=' -I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
    cc -I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include -I../include -I/usr/include/openblas -c -o convertf.o convertf.c
    convertf.c:257:5: error: implicit declaration of function 'setfamilypopnames' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
    setfamilypopnames (YES);
    ^
    1 error generated.
    make: *** [​convertf.o] Error 1

    I was able to compile everything successfully by adding function prototypes to the source files manually. For example I added this to `src/eigensrc/smartpca.c`:

    Code:
    void setid2pops (char *id2pops, Indiv **indivmarkers, int numindivs);
    void kjg_fpca (int numeigs, int fastdim, int fastiter, double *evals, double *evecs);

    However I later found an easier way to compile EIGENSOFT on Mac. I had earlier tried to run `make CC=gcc`, but it used `/usr/bin/gcc` which is actually `clang` on macOS. `brew install gcc` installs a real GCC to `/usr/local/bin/gcc-10`.

    First I ran `cd EIG-7.2.1/src/nicksrc;make`, because without it I got an error like this:

    Code:
    gauss.c:111 fatal error: ranmath.h: No such file or directory
    1 | #include "ranmath.h"
    | ^~~~~~~~~~~
    compilation terminated.
    make[1]: *** [​gauss.o] Error 1
    make: *** [nicksrc/libnick.a] Error 2

    I don't know why it worked, but I was then able to compile everything successfully by first running `make` by using GCC with the `-k` option (`make -k CC=gcc-10`) and then by using Clang (`make`).

    Short version:

    Code:
    # brew uninstall --ignore-dependencies suite-sparse
    # brew install gcc gsl homebrew/science/openblas
    wget github.com/DReichLab/EIG/archive/v7.2.1.tar.gz
    tar -xf EIG-7.2.1.tar.gz
    cd EIG-7.2.1/src/nicksrc
    make
    cd ..
    make -k CC=/usr/local/bin/gcc-10 CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
    make CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'

    I uploaded Mac binaries for EIGENSOFT here: https://drive.google.com/file/d/1H8k...ew?usp=sharing.



  4. #4
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    Here's a series of K=3 ADMIXTURE runs visualized as ternary diagrams. The clustering is based on a distance matrix generated from the admixture percentages in a K=6 run. I included only samples with the suffix ".DG" or ".SDG". I removed duplicate samples based on the version ID. For pairs of samples with PI_HAT of .25 or higher, I excluded the sample whose name was alphabetically first. I included at most two samples per population, because most of the populations with the suffix ".DG" only include one or two samples, and I didn't want other populations to be overrepresented.

    In the first run, Maasai have 28% of the West Eurasian component, but only Capoids and Bambutids have 100% of the African component when rounded to the nearest integer. Iranians, Finns, Turks, and Adygei all have 6% of the East Eurasian component. Papuans have 72% of the East Eurasian component and 12% of the African component. Ust'-Ishim has 15% of the African component.

    In the second run, I removed SSAs and Saharans (Saharawi and Mozabite). Saami and Maaori are now outliers who are in their own cluster with no other member. The bottom right component is maximal in early farmers and Southern Europeans, but Northern Europeans additionally have a part of the American component. Mansi and Saami are clearly shifted towards Americans. The sample labeled Mixtec_2 is probably mixed with Europeans, because it has 25% of the West Eurasian component.

    The third run doesn't include Australians, Papuans, Americans, or Eskimos. Now Mansi are about half European and half East Asian, Kusunda are about half East Asian and half South Asian, and Makrani are about half South Asian and half European.

    The fourth run doesn't include East Asians or Southeast Asians (which I considered to include Mongols and Northeastern Chinese ethnic groups but not Uyghurs). Now Chukchi and Mansi have more of the East Eurasian component than the West Eurasian component, even though in the previous run they had more of the West Eurasian component. (There is only one Chukchi sample which is very western and not representative of unmixed Chukchi.) Iranians are about 29% South Asian and 71% European.

    In the fifth run, I removed South Asians and MENAs. I kept Hazara and Tajiks because I considered them to be Central Asian. Now Bulgarians have about half of the Caucasian component and half of the European component. Finns have the second highest proportion of the European component after Loschbour. Altaians have about 22% of the Caucasian component and 78% of the Siberian component.


    Below are also ternary plots for European ADMIXTURE runs with 3 components.

    The first run includes basically all European populations in the 1240K+HO dataset, except I didn't include Kalmyks or Nenetses, and I didn't include populations with the suffix ".SG" or ".WGA". I included at most 16 samples per population. The first run is interesting, because Lithuanians and Basques are almost equally close to the top of the triangle, so the top component seems like a WHG-like component. Finns have about 20-30% of the Nogai component and about 70-80% of the top component. Mordvins have on average 13% of the Caucasian component but Finns have 7%. In qpAdm and Vahaduo models I have made of Uralic peoples, Mordvins have also received more CHG ancestry than Baltic Finns.

    The second run doesn't include North Caucasians, Nogais, or Bashkirs. Now the samples that are intermediate between the Volga-Ural region and Southern Europe include not only Gagauzes but also Moldovans.




    The script below makes a stacked bar chart of population averages in an ADMIXTURE run.

    You can use the program `pong` to make a graph with thin bars for each individual: https://link.springer.com/protocol/1...-0716-0199-0_4, https://github.com/ramachandran-lab/pong.

    Code:
    library(tidyverse)
    library(colorspace)
    library(reshape2)

    t=read.table("euro.4.ave")

    t=t[,c(1,4,3,5,2)]

    t=aggregate(t[,-1],list(sub("\\.(DG|SDG)","",t[,1])),mean)
    names(t)[1]="V1"

    t=t[order(as.matrix(t[,-1])%*%seq(ncol(t)-1)^2),]

    t2=reshape2::melt(t,id.vars="V1") # this uses melt because pivot_longer doesn't preserve the order of columns

    lab=round(100*t2$value)
    lab[lab==0]=""

    ggplot(t2,aes(x=fct_rev(factor(V1,level=t$V1)),y=v alue,fill=variable))+
    geom_bar(stat="identity",width=1,position=position _fill(reverse=T))+
    geom_text(aes(label=lab),position=position_stack(v just=.5,reverse=T),size=3.5)+
    coord_flip()+
    scale_x_discrete(expand=c(0,0))+
    scale_y_discrete(expand=c(0,0))+
    scale_fill_manual(values=colorspace::hex(HSV(c(300 ,210,45,20),.5,1)))+ # manual colors
    # scale_fill_manual(values=colorspace::hex(HSV(head( seq(0,360,length.out=ncol(t)),-1),.5,1)))+ # automatic colors
    theme(
    axis.text=element_text(color="black",size=11),
    axis.text.x=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none"
    )

    ggsave("a.png",width=5,height=.2*nrow(t)+.3)





    This time I didn't exclude samples based on PI_HAT, because there was no pair of samples which had a PI_HAT value of .25 or above after LD pruning.

    Code:
    curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
    f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
    x = euro
    printf %s\\n Abazin Adygei Adygei.SDG Albanian Avar Balkar Bashkir Basque Basque.SDG Belarusian Besermyan Bulgarian Chechen Chuvash Circassian Cretan.DG Croatian Czech Darginian English Estonian Finnish Finnish.DG French French.SDG Gagauz Greek Hungarian Icelandic Ingushian Italian_North Italian_South Jew_Ashkenazi Kabardinian Kaitag Kalmyk Karachai Karelian Kumyk Lak Lezgin Lezgin.DG Lithuanian Maltese Moldavian Mordovian Nogai_Astrakhan Nogai_Karachay_Cherkessia Nogai_Stavropol Norwegian Norwegian.DG Orcadian Orcadian.SDG Ossetian Polish.DG Romanian Russian Russian.SDG Russian_Archangelsk_Krasnoborsky Russian_Archangelsk_Leshukonsky Russian_Archangelsk_Pinezhsky Saami.DG Sardinian Scottish Sicilian Spanish Spanish_North Tabasaran Tatar_Kazan Tatar_Mishar Udmurt Ukrainian Ukrainian_North Veps>$x.pop
    sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' $x.pop ->$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
    plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .05 --out $x;plink --bfile $x --extract $x.prune.in --make-bed --out $x.pruned
    k=4;admixture -j4 -C .1 $x.pruned.bed $k
    tav()(awk '{n[$1]++;for(i=2;i<=NF;i++)a[$1,i]+=$i}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
    paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.pruned.fam) $x.pruned.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave

    You can get reasonable ADMIXTURE results even with ancient samples if you only include samples with a high SNP count:




    In the run above, I selected samples with over 500,000 SNPs, latitude over 30, longitude over -30, and years BP over 5,000. I removed duplicate samples by selecting unique samples by the version ID in the third field of the anno file.


    Code:
    curl -sO https://reichdata.hms.harvard.edu/pu...HO_public.anno
    igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref ')
    awk -F\\t '$15>5e5&&$11>30&&$12>-30&&$12<180&&$6>5000&&!a[$3]++{print$2,$8}' v44.3_HO_public.anno|igno

  5. #5
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    I now figured out how to combine a stacked bar chart with a dendrogram:

    Code:
    library(tidyverse)
    library(ggdendro)
    library(vegan)
    library(colorspace)

    t=read.csv("https://pastebin.com/raw/CAvFKS5L",row.names=1,header=F)
    fst=read.csv("https://pastebin.com/raw/d87iA1NH",row.names=1)

    t=t[,c(3,2,4,5,1,6)]
    names(t)=paste0("V",1:ncol(t))

    # do clustering based on FST
    fst=fst[rownames(t),rownames(t)]
    fst[fst<0]=0
    maxfst=which(fst==max(fst))[1]
    hc=reorder(hclust(as.dist(fst)),fst[,maxfst%%nrow(fst)]-fst[,maxfst%/%nrow(fst)+1])

    # do clustering based on ADMIXTURE
    # hc=hclust(dist(t),method="ward.D2")
    # hc=reorder(hc,wts=-as.matrix(t)%*%seq(ncol(t))^2)

    tree=ggdendro::dendro_data(as.dendrogram(hc),type= "triangle")

    p1=ggplot(ggdendro::segment(tree))+
    geom_segment(aes(x=y,y=x,xend=yend,yend=xend),line end="round",size=.5)+
    scale_x_continuous(expand=expansion(mult=c(0,.01)) )+
    scale_y_continuous(limits=.5+c(0,nrow(t)),expand=c (0,0))+
    theme(
    axis.text=element_blank(),
    axis.ticks=element_blank(),
    axis.ticks.length=unit(0,"pt"),
    axis.title=element_blank(),
    panel.background=element_rect(fill="white"),
    panel.grid=element_blank(),
    plot.margin=margin(5,5,5,0)
    )

    t=t[hc$labels[hc$order],]
    t2=data.frame(V1=rownames(t)[row(t)],V2=colnames(t)[col(t)],V3=unname(do.call(c,t)))
    lab=round(100*t2$V3)
    lab[lab==0]=""

    p2=ggplot(t2,aes(x=factor(V1,level=rownames(t)),y= V3,fill=V2))+
    geom_bar(stat="identity",width=1,position=position _fill(reverse=T))+
    geom_text(aes(label=lab),position=position_stack(v just=.5,reverse=T),size=3.5)+
    coord_flip()+
    scale_x_discrete(expand=c(0,0))+
    scale_y_discrete(expand=c(0,0))+
    scale_fill_manual(values=hex(HSV(c(40,180,90,140,2 70,330),.5,1)))+
    # scale_fill_manual(values=colorspace::hex(HSV(head( seq(0,360,length.out=ncol(t)+1),-1),.5,1)))+
    theme(
    axis.text=element_text(color="black",size=11),
    axis.text.x=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none",
    plot.margin=margin(5,0,5,5)
    )

    cowplot::plot_grid(p2,p1,rel_widths=c(1,.4))
    ggsave("a.png",height=.25*nrow(t),width=7)

    Below the clustering is based on an FST matrix generated by SmartPCA. The branches were sorted by first finding the pair of populations with the highest FST distance, which was Tofalars and Gagauzes, and then by arranging branches closer to Tofalars to the top and branches closer to Gagauzes to the bottom.



    BTW you can use the code below to manually find outliers in a SmartPCA run. It finds samples whose position on the first 8 eigenvectors is 3 SDs or more away from zero.

    Code:
    > t=read.table("smartpcarun.evec",row.names=1,skip=1 )
    > pop=read.table("v44.3_HO_public.ind",row.names=1)
    > ind=sub(".*:","",rownames(t))
    > lab=paste0(pop[ind,]$V3,":",ind)
    > lab[which(apply(abs(scale(t[,1:8]))>=3,1,any))]
    [1] "Altaian:ALT-876" "Gagauz:GAG-220"
    [3] "Gagauz:GAG-226" "Shor_Khakassia:KHS-001"
    [5] "Kazakh:KZH-1614" "Kazakh:KZH-1722"
    [7] "Shor_Mountain:Shor-262" "Tubalar:Tuba21"
    [9] "Tubalar:Tuba6" "Tubalar:Tuba24"
    [11] "Tubalar:Tuba9" "Tubalar:Tuba2"
    [13] "Turkish_Balikesir:Balikesir16790" "Turkish:Adana23108"
    [15] "Turkish:Adana23136" "Salar:SL1"
    [17] "Salar:SL2" "Salar:SL3"
    [19] "Salar:SL6" "Salar:SL7"
    [21] "Salar:SL9" "Salar:SL15"

  6. #6
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    The script in this post plots the output of SmartPCA with convex hulls drawn around the samples from each population. It uses a color palette with three different saturation-brightness combinations. It simultaneously makes plots for the first 8 PCs. It uses `ggforce::geom_mark_hull` to draw convex hulls with rounded corners, but you can change `concavity=100` to `concavity=3` to draw concave hulls instead. The plot area is always square so you can tile multiple plots nicely.

    Code:
    library(tidyverse)
    library(ggforce)
    library(ggrepel)
    library(colorspace)

    f="smartpcarunprefix"
    t=read.csv(paste0(f,".csv"),header=F,row.names=1)
    names(t)=paste0("PC",1:ncol(t))
    eig=head(as.double(readLines(paste0(f,".eval"))),n col(t))
    pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

    t=t[-which(apply(abs(scale(t[,1:8]))>3,1,any)),] # remove outliers whose position is 3 or more SDs away from zero on one of the first eight PCs

    ranges=apply(t,2,function(x)abs(max(x)-min(x)))

    pop=sub(":.*","",rownames(t))
    pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
    centers=aggregate(t,by=list(pop),mean)
    t$pop=pop

    set.seed(0)
    color=as.factor(sample(seq(1,length(unique(t$pop)) )))
    col = rbind (c (.5, .9), c (.25,1), c (.3, .8))
    hues=max(ceiling(length(color)/nrow(col)),2)
    pal1=as.vector(apply(col,1,function(x)hex(HSV(seq( 0,360,length=hues+1)[1:hues],x[1],x[2]))))
    pal2=as.vector(apply(col,1,function(x)hex(HSV(rep( 0,hues),0,ifelse(x[2]>=.7,0,1)))))

    for(i in seq(1,7,2)){
    p1=sym(paste0("PC",i))
    p2=sym(paste0("PC",i+1))

    maxrange=max(ranges[c(i,i+1)])
    lims=sapply(c(i,i+1),function(x)mean(range(t[,x]))+c(-.52*maxrange,.52*maxrange))

    ggplot(t,aes(!!p1,!!p2))+
    ggforce::geom_mark_hull(aes(group=pop),color=pal1[color[as.factor(pop)]],fill=pal1[color[as.factor(pop)]],concavity=100,radius=unit(.15,"cm"),expand=unit(. 15,"cm"),alpha=.2,size=.2)+
    geom_point(aes(x=!!p1,y=!!p2),color=pal1[color[as.factor(t$pop)]],size=.3)+
    geom_point(data=centers,aes(x=!!p1,y=!!p2),color=p al1[color],size=2)+
    geom_label_repel(data=centers,aes(x=!!p1,y=!!p2,la bel=Group.1),color=pal2[color],fill=pal1[color],force=3,size=2.2,label.r=unit(.1,"lines"),label.p adding=unit(.1,"lines"),label.size=0,max.overlaps= Inf,box.padding=.15,min.segment.length=0,segment.s ize=.4,segment.color=pal1[color],alpha=.8,direction="y")+
    labs(x=pct[i],y=pct[i+1])+
    coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
    scale_x_continuous(breaks=seq(-1,1,.05))+
    scale_y_continuous(breaks=seq(-1,1,.05))+
    theme(
    axis.text.x=element_text(margin=margin(.2,0,0,0,"c m")),
    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ,margin=margin(0,.2,0,0,"cm")),
    axis.text=element_text(color="black",size=6),
    axis.ticks.length=unit(-.13,"cm"),
    axis.ticks=element_line(size=.2,color="black"),
    axis.title=element_text(color="black",size=8),
    legend.position="none",
    panel.background=element_rect(fill="gray35"),
    panel.border=element_rect(color="black",fill=NA,si ze=.4),
    panel.grid=element_blank(),
    plot.background=element_rect(fill="gray35")
    )

    f=paste0("pc",i,".png")
    ggsave(f,width=7,height=7)
    }



  7. #7
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    You can do a radial dendrogram based on an FST matrix like this. For a regular vertical dendrogram, just remove `type="fan"`. `font=1` disables using an italic font and `underscore=T` disables replacing underscores with spaces.

    Code:
    library(vegan) # for reorder.hclust
    library(ape) # for plot.phylo

    t=read.csv("https://pastebin.com/raw/d87iA1NH",row.names=1)
    t[t<0]=0

    hc=hclust(as.dist(t))

    max=which(t==max(t))[1]
    hc=reorder(hc,t[,max%%nrow(t)]-t[,max%/%nrow(t)+1])

    png("a.png",w=1200,h=1200)
    plot(as.phylo(hc),cex=1.5,font=1,underscore=T,type ="fan")
    dev.off()


  8. #8
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    The scripts in this post plot population averages in the output of SmartPCA, where each population is connected with a line to the three populations it has the lowest FST distance to. The first script draws hulls around the populations based on the FST matrix, and the second script sets the color of each population based on its FST distance to a specified population.

    Code:
    library(tidyverse)
    library(cowplot)

    f="smartpcarunprefix"
    t=read.csv(paste0(f,".ave"),header=F,row.names=1)
    names(t)=paste0("PC",1:ncol(t))
    eig=head(as.double(readLines(paste0(f,".eval"))),n col(t))
    pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

    ranges=apply(t,2,function(x)abs(max(x)-min(x)))

    fst=read.csv(paste0(f,".fst"),row.names=1)[rownames(t),rownames(t)]
    t$k=cutree(hclust(as.dist(fst)),k=10)

    set.seed(0)
    color=as.factor(sample(seq(1,length(unique(t$k)))) )
    col = rbind (c (80,80), c (40,100), c (90,80))
    hues=max(ceiling(length(color)/nrow(col)),10)
    pal1=as.vector(apply(col,1,function(x)hcl(seq(0,36 0,length=hues+1)[1:hues],x[1],x[2])))

    plots=lapply(seq(1,3,2),function(xpc){
    seg0=lapply(1:3+1,function(i)apply(fst,1,function( x)unlist(t[names(sort(x)[i]),c(xpc,xpc+1)],use.names=F))%>%t%>%cbind(t[,c(xpc,xpc+1)]))
    seg=do.call(rbind,seg0)%>%setNames(paste0("V",1:4) )

    maxrange=max(ranges[c(xpc,xpc+1)])
    lims=sapply(c(xpc,xpc+1),function(x)mean(range(t[,x]))+c(-.52*maxrange,.52*maxrange))

    p1=sym(paste0("PC",xpc))
    p2=sym(paste0("PC",xpc+1))

    ggplot(t,aes(!!p1,!!p2))+
    ggforce::geom_mark_hull(aes(group=k),color=pal1[color[as.factor(t$k)]],fill=pal1[color[as.factor(t$k)]],concavity=100,radius=unit(.15,"cm"),expand=unit(. 15,"cm"),alpha=.2,size=.2)+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray70",size=.2)+
    geom_point(aes(x=!!p1,y=!!p2),color=pal1[color[as.factor(t$k)]],size=1)+
    geom_text(aes(x=!!p1,y=!!p2,label=rownames(t)),vju st=-.7,color=pal1[color[as.factor(t$k)]],size=2.2)+
    labs(x=pct[xpc],y=pct[xpc+1])+
    coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
    scale_x_continuous(breaks=seq(-1,1,.02))+
    scale_y_continuous(breaks=seq(-1,1,.02))+
    theme(
    axis.text=element_text(color="black",size=6),
    axis.ticks=element_blank(),
    axis.ticks.length=unit(0,"pt"),
    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
    axis.title=element_text(color="black",size=8),
    legend.position="none",
    panel.background=element_rect(fill="gray40"),
    panel.border=element_rect(color="gray30",fill=NA,s ize=.4),
    panel.grid.major=element_line(color="gray33",size= .2),
    panel.grid.minor=element_blank(),
    plot.background=element_rect(fill="gray40",color=N A)
    )
    })

    plot_grid(plotlist=plots)
    ggsave("a.png",height=6,width=12)




    Code:
    library(tidyverse)
    library(cowplot)

    f="smartpcarunprefix"
    t=read.csv(paste0(f,".ave"),header=F,row.names=1)
    names(t)=paste0("PC",1:ncol(t))
    eig=head(as.double(readLines(paste0(f,".eval"))),n col(t))
    pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

    ranges=apply(t,2,function(x)abs(max(x)-min(x)))

    fst=read.csv(paste0(f,".fst"),row.names=1)[rownames(t),rownames(t)]
    target="Mansi"
    dist=fst[,target]/1e6

    plots=lapply(seq(1,3,2),function(xpc){
    seg0=lapply(1:3+1,function(i)apply(fst,1,function( x)unlist(t[names(sort(x)[i]),c(xpc,xpc+1)],use.names=F))%>%t%>%cbind(t[,c(xpc,xpc+1)]))
    seg=do.call(rbind,seg0)%>%setNames(paste0("V",1:4) )

    maxrange=max(ranges[c(xpc,xpc+1)])
    lims=sapply(c(xpc,xpc+1),function(x)mean(range(t[,x]))+c(-.53*maxrange,.53*maxrange))

    p1=sym(paste0("PC",xpc))
    p2=sym(paste0("PC",xpc+1))

    ggplot(t,aes(!!p1,!!p2))+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray70",size=.2)+
    geom_point(aes(x=!!p1,y=!!p2,color=dist),size=.8)+
    geom_text(aes(x=!!p1,y=!!p2,label=rownames(t),colo r=dist),vjust=-.7,size=2.2)+
    labs(x=pct[xpc],y=pct[xpc+1])+
    coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
    scale_x_continuous(breaks=seq(-1,1,.02))+
    scale_y_continuous(breaks=seq(-1,1,.02))+
    scale_color_gradientn(colors=hcl(c(240,220,200,170 ,120,70,30,0,320,280)+15,100,70),name=paste("FST distance to",target),breaks=seq(0,max(dist),.01),limits=c(0 ,max(dist)))+
    guides(color=guide_colorbar(ticks=F))+
    theme(
    axis.text=element_text(color="black",size=6),
    axis.ticks=element_line(size=.2,color="gray60"),
    axis.ticks.length=unit(-.13,"cm"),
    axis.text.x=element_text(margin=margin(.2,0,0,0,"c m")),
    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ,margin=margin(0,.2,0,0,"cm")),
    axis.title=element_text(color="black",size=8),
    legend.direction="horizontal",
    legend.justification="left",
    legend.key.height=unit(.5,"cm"),
    legend.key.width=unit(1,"cm"),
    legend.margin=margin(0,0,-7,0),
    legend.position=c(.02,.97),
    legend.spacing.y=unit(.05,"cm"),
    legend.text=element_text(size=6,vjust=.5),
    legend.title=element_text(size=8,vjust=.7),
    panel.background=element_rect(fill="white"),
    panel.border=element_rect(color="gray60",fill=NA,s ize=.4),
    panel.grid=element_blank()
    )
    })

    plot_grid(plots[[1]],plots[[2]]+theme(legend.position="none"))
    ggsave("a.png",height=6,width=12)



  9. #9
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    The script below uses the `ComplexHeatmap` package to combine heatmaps for ADMIXTURE runs at multiple K values: https://jokergoo.github.io/ComplexHe...eference/book/.

    The clustering is based on a matrix where the matrixes of each run have been joined into a single wide matrix. The clustering tree is ordered based on a K=3 run, where populations with a high proportion of the first component are moved to the top and populations with a high proportion of the third component are moved to the bottom.

    Code:
    library(ComplexHeatmap)
    library(circlize) # for colorRamp2
    library(colorspace) # for hex
    library(vegan) # for reorder.hclust (may be masked by seriation::reorder_hclust)

    quals = c (2,3,4,5,6,8)

    # columnorder=lapply(kvals,seq)
    columnorder=list(c(2,1),c(3,2,1),c(1,2,4,3),c(3,1, 5,4,2),c(3,2,4,5,1,6),c(2,3,1,7,4,8,6,5))

    mats=sapply(1:length(kvals),function(i)100*read.ta ble(paste0("turk.",kvals[i],".ave"),row.names=1)[,columnorder[[i]]])
    comb=do.call(cbind,mats)

    png("a.png",w=5000,h=5000,res=144)

    maps = sapply (kvals, function (k) {
    mat=as.matrix(mats[match(k,kvals)][[1]])
    Heatmap(
    mat,
    show_heatmap_legend=F,
    show_column_names=F,
    show_row_names=F,
    clustering_distance_rows="euclidean",
    width=ncol(mat)*unit(30,"pt"),
    height=nrow(mat)*unit(30,"pt"),
    row_dend_width=unit(200,"pt"),
    cluster_columns=F,
    cluster_rows=reorder(hclust(dist(comb)),mats[[2]][,3]-mats[[2]][,1]),
    column_title=paste0("K=",k),column_title_gp=gpar(f ontsize=24),
    right_annotation=rowAnnotation(text1=anno_text(gt_ render(rownames(mat),padding=unit(c(2,2,2,2),"mm") ),just="left",location=unit(0,"npc"),gp=gpar(fonts ize=17))),
    col=colorRamp2(seq(0,100,length.out=7),hex(HSV(c(2 10,210,130,60,40,20,0),c(0,rep(.5,6)),1))),
    cell_fun=function(j,i,x,y,w,h,fill)grid.text(sprin tf("%.0f",mat[i,j]),x,y,gp=gpar(fontsize=15))
    )
    })

    draw(Reduce("+",maps))
    dev.off()
    system("mogrify -gravity center -trim -border 16 -bordercolor white a.png")





    I also modified the ternary diagram script I posted earlier, so it can now be used to create a square diagram of a K=4 run or a hexagonal diagram of a K=6 run. However now there is no longer a one-to-one correspondence between the coordinates in the plot and the proportions of admixture components. For example in the square diagram, a sample in the center of the plot can now have either 25% of all four components or 50% of the top component and 50% of the bottom component.

    I now also found a good way to use `ggrepel` to avoid overlapping labels by setting `point.size=0` (which doesn't repel the labels away from points but only from other labels).

    Code:
    library(tidyverse)
    library(ggforce)
    library(ggrepel)

    t=read.table("https://pastebin.com/raw/V5RjCpmk") # K=4
    # t=read.table("https://pastebin.com/raw/1j20UVLD") # K=6

    row.names(t)=paste0(t[,2],":",t[,1])
    t=t[,-c(1,2)]

    tri=rbind(c(0,1),c(1,0),c(0,-1),c(-1,0)) # diamond (for K=4)
    # tri=rbind(c(1,1),c(1,-1),c(-1,-1),c(-1,1)) # square (for K=4)
    # tri=rbind(c(.5,sqrt(3)/2),c(1,0),c(.5,-sqrt(3)/2),c(-.5,-sqrt(3)/2),c(-1,0),c(-.5,sqrt(3)/2)) # hexagon (for K=6)

    xy=as.data.frame(as.matrix(t)%*%tri)

    grid=apply(rbind(c(1,2,4,3),c(1,4,2,3),c(1,2,1,4), c(3,2,3,4),c(4,1,4,3),c(2,1,2,3)),1,function(x)cbi nd(
    seq(tri[x[1],1],tri[x[2],1],length.out=11),
    seq(tri[x[1],2],tri[x[2],2],length.out=11),
    seq(tri[x[3],1],tri[x[4],1],length.out=11),
    seq(tri[x[3],2],tri[x[4],2],length.out=11)
    )%>%as.data.frame)%>%bind_rows

    # grid=as.data.frame(rbind(cbind(tri,rbind(tri[-1,],tri[1,])),cbind(tri,matrix(rep(0,12),ncol=2)))) # for hexagon

    pop=sub(":.*","",rownames(xy))
    pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
    centers=aggregate(xy,by=list(pop),mean)
    xy$pop=pop

    set.seed(1234)
    color=as.factor(sample(seq(1,length(unique(xy$pop) ))))
    col=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60 ,100),c(20,50),c(15,40))
    hues=max(ceiling(length(color)/nrow(col)),8)
    pal1=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],x[1],x[2])))
    pal2=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],ifelse(x[2]>50,.8*x[1],.2*x[1]),ifelse(x[2]>50,.3*x[2],100))))

    # add a small random factor so geom_voronoi_tile won't fail because of too many overlapping points
    xy$V1=xy$V1+runif(nrow(xy))/1e4
    xy$V2=xy$V2+runif(nrow(xy))/1e4

    ggplot(xy,aes(x=V1,y=V2,group=-1L))+
    geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray90",size=.3)+
    geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.05)+
    # geom_point(data=centers,aes(x=V1,y=V2,color=color, fill=color),shape=21,size=2)+
    # geom_label(data=centers,aes(x=V1,y=V2,label=Group. 1,color=color,fill=color),alpha=.7,size=2,label.r= unit(.07,"lines"),label.padding=unit(.07,"lines"), label.size=0)+
    geom_label_repel(data=centers,aes(x=V1,y=V2,label= Group.1,color=color,fill=color),size=2.2,label.r=u nit(.1,"lines"),label.padding=unit(.1,"lines"),lab el.size=.1,max.overlaps=Inf,box.padding=0,point.si ze=0)+
    coord_fixed(xlim=c(-1.08,1.08),ylim=c(-1.08,1.08),expand=F)+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal2)+
    theme(
    axis.text=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none",
    panel.background=element_rect(fill="white"),
    plot.margin=margin(0,0,0,0,"cm")
    )

    ggsave("a.png",width=7,height=7)



    This recreates the data I used in this post:

    Code:
    curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
    f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
    Printfa% s \\ n Altaian Altaian_Chelkan Azeri Balkar Bashkir Chuvash Dolgan Gagauz Karachai Karakalpak Kazakh Kazakh_China Khakass Khakass_Kachin Kumyk Kyrgyz_China Kyrgyz_Kyrgyzstan Kyrgyz_Tajikistan Kyrgyz_Tajikstan Nogai Nogai_Astrakhan Nogai_krcy_Cherkessia Nogai_Stavropol Salar Shor_Khakassia Shor_mounten Tatar_Kazan Tatar_Mishar Tatar_saiberian Tatar_saiberian_Zabolotniye Todzin Tofalar Tubalar Turkish Turkish_Balikesir Turkmen Tuvinian Uyghur Uzbek Yakut> Pop
    x=turk;sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' pop ->$x.temp.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.temp.pick v44.3_HO_public.fam) --make-bed --out $x.temp
    plink --allow-no-sex --bfile $x.temp --genome --out $x
    awk 'FNR>1&&$10>=.25{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.temp.pick>$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
    plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .05 --out $x
    plink --bfile $x --extract $x.prune.in --make-bed --out $x.pruned
    tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
    for k in 2 3 4 5 6 8;do admixture -j4 -C .1 $x.pruned.bed $k;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.pruned.fam) $x.pruned

  10. #10
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    The scripts in this post visualize a series of ADMIXTURE runs as polygons where each corner of the polygon represents one admixture component.

    The second script concatenates the matrices of all K values into a single wide matrix, which is then used to cluster the populations and to connect each population to its three closest neighbors.

    Individual samples, light color scheme:

    Code:
    library(tidyverse)
    library(ggforce)
    library(ggrepel)

    for(n in 3:8){
    t=read.table(paste0("euro5.i.",n))%>%set_rownames( paste0(.[,2],":",.[,1]))%>%select(-c(1,2))

    t=t[,list(c(2,1,3),c(4,3,2,1),c(2,5,4,3,1),c(4,1,2,3,6 ,5),c(1,7,5,3,2,6,4),c(2,3,7,4,1,8,5,6))[[n-2]]]

    corners=sapply(c(sin,cos),function(x)head(x(seq(0, 2,length.out=n+1)*pi),-1))
    corners=corners*min(2/diff(apply(corners,2,range)))
    corners[,2]=corners[,2]-mean(range(corners[,2]))

    xy=as.data.frame(as.matrix(t)%*%corners)
    grid=as.data.frame(rbind(cbind(corners,rbind(corne rs[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol =2,nrow=n,byrow=T))))

    pop=rownames(xy)%>%sub(":.*","",.)%>%sub("\\.(DG|S DG|SG|WGA)","",.)
    centers=aggregate(xy,by=list(pop),mean)
    xy$pop=pop

    set.seed(0)
    color=as.factor(sample(seq(1,length(unique(xy$pop) ))))
    cl=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60, 100),c(20,50),c(15,40))
    hues=max(ceiling(length(color)/nrow(cl)),2)
    pal1=as.vector(apply(cl,1,function(x)hcl(head(seq( 15,375,length=hues+1),-1),x[1],x[2])))
    pal2=as.vector(apply(cl,1,function(x)hcl(head(seq( 15,375,length=hues+1),-1),ifelse(x[2]>=60,.5*x[1],.1*x[1]),ifelse(x[2]>=60,.2*x[2],95))))

    xy$V1=xy$V1+runif(nrow(xy))/1e3
    xy$V2=xy$V2+runif(nrow(xy))/1e3

    lims=c(-1,1)*1.08

    ggplot(xy,aes(x=V1,y=V2))+
    geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray85",size=.3)+
    geom_voronoi_tile(aes(group=0,fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.055)+
    geom_label_repel(data=centers,aes(x=V1,y=V2,label= Group.1,color=color,fill=color),max.overlaps=Inf,p oint.size=0,size=2.3,alpha=.8,label.r=unit(.1,"lin es"),label.padding=unit(.1,"lines"),label.size=.1, box.padding=0,segment.size=.3)+
    coord_fixed(xlim=lims,ylim=lims,expand=F)+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal2)+
    theme(
    axis.text=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none",
    panel.background=element_rect(fill="white")
    )

    ggsave(paste0(n,".png"),width=7,height=7)
    }



    Population averages, dark color scheme:

    Code:
    library(tidyverse)
    library(ggforce)
    library(ggrepel)

    for(k in 3:8){
    t=read.table(paste0("hqhg19.",k,"a"),row.names=1)
    rownames(t)%<>%sub("\\.(DG|SDG|SG|WGA)","",.)

    t=t[,list(c(1,2,3),c(2,3,4,1),c(2,5,1,4,3),c(1,5,4,3,6 ,2),c(4,5,7,3,1,2,6),c(5,3,1,8,6,4,7,2))[[k-2]]]

    corners=sapply(c(sin,cos),function(x)head(x(seq(0, 2,length.out=k+1)*pi),-1))
    corners=corners*min(2/diff(apply(corners,2,range)))
    corners[,2]=corners[,2]-mean(range(corners[,2]))

    xy=as.data.frame(as.matrix(t)%*%corners)
    grid=as.data.frame(rbind(cbind(corners,rbind(corne rs[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol =2,nrow=k,byrow=T))))

    joined=sapply(2:8,function(i)read.table(paste0("hq hg19.",i,"a"))[,-1])%>%do.call(cbind,.)%>%set_rownames(rownames(t))
    dist=as.data.frame(as.matrix(dist(joined)))
    seg=lapply(1:4,function(i)apply(dist,1,function(x) unlist(xy[names(sort(x)[i]),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.) %>%setNames(paste0("V",1:4))
    xy$k=as.factor(cutree(hclust(dist(joined)),16))

    set.seed(1488)
    color=as.factor(sample(seq(length(unique(xy$k)))))
    cl=rbind(c(50,90),c(100,80))
    hues=max(ceiling(length(color)/nrow(cl)),8)
    pal1=as.vector(apply(cl,1,function(x)hcl(head(seq( 15,375,length=hues+1),-1),x[1],x[2])))

    xy$V1=xy$V1+runif(nrow(xy))/1e3
    xy$V2=xy$V2+runif(nrow(xy))/1e3

    expand=c(.08,.02)

    ggplot(xy,aes(x=V1,y=V2))+
    geom_polygon(data=as.data.frame(corners),fill="gra y40")+
    geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray50",size=.5)+
    geom_mark_hull(aes(group=k,color=k,fill=k),concavi ty=1000,radius=unit(.3,"cm"),expand=unit(.3,"cm"), alpha=.2,size=.15)+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray20",size=.3)+
    geom_point(aes(color=k),size=.5)+
    geom_text_repel(aes(label=rownames(xy),color=k),ma x.overlaps=Inf,force=3,force_pull=2,size=2.3,segme nt.size=.2,min.segment.length=.2,box.padding=.05)+
    coord_fixed(xlim=(1+expand[1])*c(-1,1),ylim=(1+expand[2])*c(-1,1))+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal1)+
    theme(
    axis.text=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none",
    panel.background=element_rect(fill="gray30"),
    panel.grid=element_blank(),
    plot.background=element_rect(fill="gray30",color=N A,size=0),
    plot.margin=margin(0,0,0,0)
    )

    ggsave(paste0("t/",k,".png"),width=7,height=7/(Reduce(`/`,1+expand)))
    }



  11. #11
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    This joins the Q matrices of runs at different K values into a single wide matrix and uses it to cluster populations. The clustering tree is cut into 12 groups which are used to color-code the populations. (I also tried to do the clustering based on an FST matrix, but it made less sense, so I commented out the code for clustering based on FST.)

    The branches of the clustering tree are reordered based on the pair of populations with the highest distance, which was Turkish_Balikesir and Nganasan in this case, so that branches closer to Turkish_Balikesir are moved to the top and branches closer to Nganasan are moved to the bottom.

    Code:
    library(tidyverse)
    library(ggdendro)
    library(vegan)
    library(colorspace)
    library(cowplot)

    t=read.table("https://pastebin.com/raw/FEwYnBNb",row.names=1) # population averages from ADMIXTURE with population names in first column

    t=t[,c(3,2,5,1,4)] # reorder columns (change if your input does not have five columns)
    names(t)=paste0("V",1:ncol(t))

    # do clustering based on a combined matrix of admixture weights at different K values
    joined=sapply(3:8,function(i)read.table(paste0("ur alaltaic.",i,".ave"))[,-1])%>%do.call(cbind,.)%>%set_rownames(rownames(t))
    hc=hclust(dist(joined))
    hc=reorder(hc,-as.matrix(t)%*%seq(ncol(t))^2)
    dist=as.matrix(dist(joined))
    maxdist=which(dist==max(dist))[1]
    hc=reorder(hc,dist[,maxdist%%nrow(dist)]-dist[,maxdist%/%nrow(dist)+1])

    # fst=read.csv("https://pastebin.com/raw/ktMkDf24",row.names=1)[rownames(t),rownames(t)]
    # fst[fst<0]=0
    # maxfst=which(fst==max(fst))[1] # reorder branches based on distance to the pair of populations with the highest FST distance
    # hc=reorder(hclust(as.dist(fst)),fst[,maxfst%%nrow(fst)]-fst[,maxfst%/%nrow(fst)+1])
    # # hc=reorder(hclust(dist(t)),-as.matrix(t)%*%exp(seq(ncol(t)))) # reorder branches based on the order of the bars

    k=as.factor(cutree(hc,12))[hc$labels[hc$order]]

    tree=ggdendro::dendro_data(as.dendrogram(hc),type= "triangle")

    p1=ggplot(ggdendro::segment(tree))+
    geom_segment(aes(x=y,y=x,xend=yend,yend=xend),size =.5,lineend="round",color="gray85")+ # `lineend="round"` draws corners properly when not using `type="triangle"`
    scale_x_continuous(expand=expansion(mult=c(0,.01)) )+ # the small expansion prevents cropping a few pixels from the right border of the tree
    scale_y_continuous(limits=.5+c(0,nrow(t)),expand=c (0,0))+
    theme(
    axis.text=element_blank(),
    axis.ticks=element_blank(),
    axis.ticks.length=unit(0,"pt"), # remove extra space normally occupied by tick marks
    axis.title=element_blank(),
    panel.background=element_rect(fill="gray30"),
    panel.grid=element_blank(),
    plot.background=element_rect(fill="gray30",color=N A), # `color=NA` removes a thin white border around the plot
    plot.margin=margin(5,5,5,0)
    )

    t=t[hc$labels[hc$order],]
    t2=data.frame(V1=rownames(t)[row(t)],V2=colnames(t)[col(t)],V3=unname(do.call(c,t))) # an alternative to `pivot_longer` and `melt`
    lab=round(100*t2$V3)
    lab[lab<=1]=""

    p2=ggplot(t2,aes(x=factor(V1,level=rownames(t)),y= V3,fill=V2))+
    geom_bar(stat="identity",width=1,position=position _fill(reverse=T),size=.2,color="gray10")+
    geom_text(aes(label=lab),position=position_stack(v just=.5,reverse=T),size=3.5,color="gray10")+
    coord_flip()+
    scale_x_discrete(expand=c(0,0))+
    scale_y_discrete(expand=c(0,0))+
    scale_fill_manual(values=colorspace::hex(HSV(c(30, 210,250,310,0),.4,.9)))+
    theme(
    axis.text=element_text(color=hex(HSV(seq(0,360,len gth.out=n_distinct(k)+1)%>%head(-1),.4,1))[k],size=11),
    axis.text.x=element_blank(),
    axis.ticks=element_blank(),
    axis.title=element_blank(),
    legend.position="none",
    plot.background=element_rect(fill="gray30",color=N A),
    panel.background=element_rect(fill="gray30"),
    plot.margin=margin(5,0,5,5)
    )


    cowplot::plot_grid(p2,p1,rel_widths=c(1,.4))
    ggsave("a.png",height=.27*nrow(t),width=7)


  12. #12
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    David Bush told me about this Perl script which uses R to draw a circular bar chart of an ADMIXTURE run: https://github.com/QidiFeng/Ancestry...stryPainter.pl, https://www.sciencedirect.com/scienc...72022918304273.

    I now figured out how to draw similar graphs using the Circlize package: https://jokergoo.github.io/circlize_book/book/.

    Display two K values, do clustering based on a combined matrix of runs at different K values:

    Code:
    library(circlize)
    library(vegan) # for reorder.hclust
    library(dendextend) # for color_branches (may be masked by the package `seriation`)

    f="uralaltaic.i"
    quals = c (3.5)
    columnorder=list(c(1,2,3),c(1,4,5,3,2))

    labelcolor=hcl(c(260,120,60,0,210,160,310,30)+15,6 0,70)
    barcolor=list(hcl(c(310,120,210)+15,60,70),hcl(c(3 10,260,120,60,210)+15,60,70))

    mats=sapply(kvals,function(x)read.table(paste0(f," .",x,"a"),r=1)[,columnorder[lapply(columnorder,length)==x][[1]]])

    joined=do.call(cbind,sapply(Sys.glob(paste0(f,".[0-9]a")),function(x)read.table(x,r=1)))
    hc=hclust(dist(joined))
    hc=reorder(hc,mats[[1]][,3]-mats[[1]][,1])
    rows=nrow(joined)

    labels=hc$labels[hc$order]
    cut=cutree(hc,8)
    dend=color_branches(as.dendrogram(hc),k=length(uni que(cut)),col=labelcolor[unique(cut[labels])])

    circos.clear()
    png("a.png",w=2500,h=2500,res=300)
    circos.par(cell.padding=c(0,0,0,0))
    circos.initialize(0,xlim=c(0,rows))

    circos.track(ylim=c(0,1),track.height=.2,track.mar gin=c(0,0),bg.border=NA,
    panel.fun=function(x,y)for(i in 1:rows)circos.text(i-.5,0,labels[i],adj=c(0,.5),facing="clockwise",niceFacing=T,cex=. 65,col=labelcolor[cut[labels[i]]]))

    for(i in length(mats):1)
    circos.track(ylim=c(0,1),track.height=.2,track.mar gin=c(0,.01),bg.border=NA,
    panel.fun=function(x,y)circos.barplot(as.matrix(ma ts[[i]])[hc$order,],-.5+1:rows,col=barcolor[[i]],border="gray20",bar_width=1,lwd=.3))

    circos.track(ylim=c(0,attr(dend,"height")),track.h eight=.35,track.margin=c(0,0),bg.border=NA,
    panel.fun=function(x,y)circos.dendrogram(dend))

    circos.clear()
    dev.off()

    Display one K value, do clustering based on a single K value, show percentages inside bars:

    Code:
    library(circlize)
    library(vegan) # for reorder.hclust (may be masked by the package `seriation`)
    library(dendextend) # for color_branches

    t=read.table(text="Kalmyk 0.119357 0.725057 0.000010 0.037803 0.117774
    Kyrgyz_China 0.039367 0.512079 0.230150 0.095038 0.123366
    Altaian_Chelkan 0.034095 0.000010 0.919478 0.000010 0.046407
    Azeri 0.051638 0.004671 0.010727 0.902646 0.030318
    Uzbek 0.102725 0.273261 0.001854 0.452126 0.170033
    Salar 0.000010 0.539636 0.460334 0.000010 0.000010
    Tatar_Kazan 0.113456 0.057026 0.000010 0.099336 0.730171
    Tatar_Siberian 0.251376 0.221389 0.000010 0.077505 0.449721
    Finnish 0.007214 0.000010 0.000010 0.015174 0.977592
    Yakut 0.505434 0.473202 0.000010 0.002914 0.018440
    Mansi 0.572791 0.000010 0.000010 0.000010 0.427179
    Altaian 0.222424 0.335614 0.358801 0.032694 0.050468
    Shor_Mountain 0.233984 0.000010 0.724596 0.000010 0.041400
    Chuvash 0.180171 0.011056 0.000010 0.006462 0.802301
    Enets 0.920409 0.000010 0.000010 0.000010 0.079561
    Yukagir_Tundra 0.710359 0.289611 0.000010 0.000011 0.000010
    Kyrgyz_Tajikistan 0.104000 0.563708 0.000010 0.125799 0.206483
    Khakass_Kachin 0.254253 0.416760 0.174200 0.005262 0.149525
    Tuvinian 0.448940 0.448899 0.000010 0.031803 0.070348
    Besermyan 0.209841 0.001487 0.000010 0.000460 0.788202
    Nogai_Astrakhan 0.062497 0.463590 0.000010 0.183203 0.290701
    Hours 0.725173 0.257670 0.000010 0.005836 0.011312
    Kazakh 0.067027 0.518213 0.087979 0.114550 0.212231
    Tofalar 0.815599 0.110299 0.000010 0.009693 0.064398
    Karakalpak 0.009983 0.316964 0.389103 0.158275 0.125676
    Estonian 0.000010 0.000010 0.000010 0.004409 0.995561
    Dolgan 0.694025 0.255361 0.000010 0.049624 0.000979
    Tatar_Siberian_Zabolotniye 0.521637 0.020132 0.000010 0.000010 0.458212
    Uyghur 0.043578 0.486742 0.000010 0.318983 0.150687
    Udmurt 0.256391 0.000010 0.001010 0.000010 0.742579
    Evenk_FarEast 0.241328 0.606202 0.000010 0.000010 0.152451
    Selkup 0.804662 0.000010 0.000010 0.000010 0.195308
    Kumyk 0.060751 0.000112 0.000010 0.823905 0.115222
    Hungarian 0.000010 0.000010 0.000010 0.244311 0.755659
    Tubalar 0.159517 0.009457 0.802778 0.000010 0.028238
    Turkmen 0.123631 0.226543 0.000010 0.529793 0.120023
    Karelian 0.012854 0.000010 0.000010 0.000010 0.987116
    Kazakh_China 0.074285 0.573009 0.152931 0.069362 0.130412
    Mongol 0.033174 0.847004 0.025135 0.005178 0.089509
    Cycle 0.000010 0.995215 0.000010 0.000010 0.004755
    Evenk_Transbaikal 0.611414 0.388556 0.000010 0.000010 0.000010
    Nogai_Karachay_Cherkessia 0.119988 0.120774 0.000010 0.617261 0.141967
    Veps 0.026887 0.000010 0.000010 0.000010 0.973083
    Even 0.441349 0.278457 0.000010 0.015239 0.264946
    Nganasan 0.999960 0.000010 0.000010 0.000010 0.000010
    Bashkir 0.114088 0.056493 0.251488 0.030390 0.547542
    Xibo 0.000010 0.985541 0.000010 0.000362 0.014077
    Khakass 0.202707 0.171413 0.530905 0.007675 0.087300
    Shor_Khakassia 0.258218 0.000010 0.694889 0.000010 0.046873
    Nanai 0.105903 0.894067 0.000010 0.000010 0.000010
    Buryat 0.064420 0.848458 0.017066 0.001696 0.068360
    Yukagir_Forest 0.379416 0.096266 0.000010 0.003580 0.520728
    Karachai 0.067138 0.004534 0.000010 0.798982 0.129336
    Mordovian 0.022303 0.001193 0.000010 0.025251 0.951243
    Turkish_Balikesir 0.092314 0.038550 0.000010 0.804964 0.064163
    Turkish 0.040918 0.012255 0.000010 0.873179 0.073639
    Kyrgyz_Kyrgyzstan 0.090129 0.607265 0.000010 0.122885 0.179711
    Balkar 0.075115 0.000010 0.000010 0.829730 0.095136
    Gagauz 0.000010 0.027887 0.015891 0.601619 0.354593
    Nogai_Stavropol 0.070584 0.403817 0.000010 0.244701 0.280888
    Negidal 0.248518 0.751452 0.000010 0.000010 0.000010
    Tatar_Mishar 0.066112 0.037441 0.010377 0.138008 0.748062",row.names=1)

    hc=hclust(dist(t))
    hc=reorder(hc,-(t[,1]+t[,2]-t[,4]-2*t[,5]))

    labelcolor=hcl(c(260,90,120,60,0,210,180,310)+15,6 0,70)
    barcolor=hcl(c(310,260,120,60,210)+15,60,70)

    labels=hc$labels[hc$order]
    cut=cutree(hc,8)
    dend=color_branches(as.dendrogram(hc),k=length(uni que(cut)),col=labelcolor[unique(cut[labels])])
    t=t[hc$labels[hc$order],]

    circos.clear()
    png("a.png",w=2500,h=2500,res=300)
    circos.par(cell.padding=c(0,0,0,0),gap.degree=5,po ints.overflow.warning=F)
    circos.initialize("a",xlim=c(0,nrow(t)))

    circos.track(ylim=c(0,1),bg.border=NA,track.height =.28,track.margin=c(.01,0),
    panel.fun=function(x,y)for(i in 1:nrow(t))circos.text(i-.5,0,labels[i],adj=c(0,.5),facing="clockwise",niceFacing=T,cex=. 65,col=labelcolor[cut[labels[i]]]))

    circos.track(ylim=c(0,1),track.margin=c(0,0),track .height=.35,bg.lty=0,panel.fun=function(x,y){
    mat=as.matrix(t)
    pos=1:nrow(mat)-.5
    barwidth=1
    for(i in 1:ncol(mat)){
    seq1=rowSums(mat[,seq(i-1),drop=F])
    seq2=rowSums(mat[,seq(i),drop=F])
    circos.rect(pos-barwidth/2,if(i==1){0}else{seq1},pos+barwidth/2,seq2,col=barcolor[i],border="gray20",lwd=.1)
    }
    for(i in 1:ncol(mat)){
    lab=round(100*t[,i])
    lab[lab<=1]=""
    seq1=rowSums(mat[,seq(i-1),drop=F])
    seq2=rowSums(mat[,seq(i),drop=F])
    circos.text(pos,if(i==1){seq1/2}else{seq1+(seq2-seq1)/2},labels=lab,col="gray10",cex=.4,facing="downward ")
    }
    })

    circos.track(ylim=c(0,attr(dend,"height")),bg.bord er=NA,track.margin=c(0,.0015),track.height=.35,pan el.fun=function(x,y)circos.dendrogram(dend))

    circos.clear()
    dev.off()


  13. #13
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    The script below displays the FST distance to one population on the x axis and to another population on the y axis. The clustering is based on the full FST matrix. Each population is also connected with a line to the two populations it has the lowest FST distance to.

    Code:
    library(tidyverse)
    library(ggforce) # for geom_mark_hull

    t=read.csv("https://pastebin.com/raw/Ja3qPUq3",row.names=1,check.names=F)/1e6

    tav=function(x,y)data.frame(aggregate(x,list(y),me an),row.names=1,check.names=F)
    pops=sub("\\.(DG|SG|SDG|WGA)","",rownames(t))
    t=tav(t(tav(t,pops)),pops)

    igno=rownames(t)%in%c("Naxi","Chukchi","Samaritan" ,"Itelmen","AA","Pima","Chukchi1","Atayal")
    t=t[!igno,!igno]

    xpop="Estonian"
    ypop="Han"
    xy=t[,c(xpop,ypop)]

    seg=lapply(1:3,function(i)apply(t,1,function(x)unl ist(xy[names(sort(x)[i]),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.) %>%setNames(paste0("V",1:4))

    expand=.004
    lims=c(-expand,max(apply(xy,2,range))+expand)

    names(xy)=c("x","y")
    xy$k=as.factor(cutree(hclust(as.dist(t)),16))

    lab=sub("^0","",sprintf("%.2f",seq(0,1,.01)))
    lab[1]="0"

    ggplot(xy,aes(x,y))+
    geom_abline(linetype="dashed",color="gray80",size= .3)+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray20",size=.1)+
    geom_mark_hull(aes(color=k,fill=k),concavity=1000, radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha= .2,size=.15)+
    # geom_polygon(data=xy%>%group_by(k)%>%slice(chull(x ,y)),alpha=.2,aes(color=k,fill=k),size=.3)+ # use convex hulls without rounded corners
    geom_point(aes(color=k),size=.5)+
    geom_text(aes(color=k),label=rownames(xy),size=2,v just=-.7)+
    # geom_text_repel(aes(color=k),label=rownames(xy),si ze=2,max.overlaps=Inf,force=2,segment.size=.2,min. segment.length=.2,box.padding=.05)+
    coord_cartesian(xlim=lims,ylim=lims,expand=F)+
    scale_x_continuous(breaks=seq(0,1,.01),labels=lab) +
    scale_y_continuous(breaks=seq(0,1,.01),labels=lab) +
    labs(x=paste("FST distance to",xpop),y=paste("FST distance to",ypop))+
    theme(
    axis.text=element_text(size=6),
    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
    axis.ticks=element_blank(),
    axis.ticks.length=unit(0,"cm"),
    axis.title=element_text(size=8),
    legend.position="none",
    panel.background=element_rect(fill="white"),
    panel.border=element_rect(color="gray80",fill=NA,s ize=.6),
    panel.grid.major=element_line(color="gray90",size= .2)
    )

    ggsave("a.png",width=7,height=7)

    It is also possible to use the FST matrix to do MDS (which is basically a version of PCA that takes a distance matrix as an input):

    Code:
    library(tidyverse)
    library(ggforce) # for geom_mark_hull
    library(colorspace) # for hex

    t=read.csv("https://pastebin.com/raw/Ja3qPUq3",row.names=1,check.names=F)/1e6

    igno=rownames(t)%in%c("Naxi.DG","Chukchi.DG","Sama ritan.DG","AA","Pima","Chukchi1","Atayal");t=t[!igno,!igno]

    tav=function(x,y)data.frame(aggregate(x,list(y),me an),row.names=1,check.names=F)
    rn2=sub("\\.(DG|SG|SDG|WGA)","",rownames(t));t=tav (t(tav(t,rn2)),rn2)

    ncol=20
    mds=cmdscale(as.dist(t),k=ncol,eig=T)
    p=as.data.frame(mds$points)
    eig=head(mds$eig,ncol)
    pct=paste0("MDS dimension ",1:ncol," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

    ranges=apply(p,2,function(x)abs(max(x)-min(x)))
    nk=24
    p$k=as.factor(cutree(hclust(as.dist(t)),nk))

    pal1=hex(HSV(seq(0,360,length.out=nk+1)%>%head(-1),.55,1))
    pal2=hex(HSV(seq(0,360,length.out=nk+1)%>%head(-1),.25,1))

    for(i in seq(1,7,2)){
    seg=lapply(1:4,function(j)apply(t,1,function(x)unl ist(p[names(sort(x)[j]),c(i,i+1)],use.names=F))%>%t%>%cbind(p[,c(i,i+1)]))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))

    p1=sym(paste0("V",i))
    p2=sym(paste0("V",i+1))

    maxrange=max(ranges[c(i,i+1)])
    lims=sapply(c(i,i+1),function(x)mean(range(t[,x]))+c(-.52*maxrange,.52*maxrange))

    ggplot(p,aes(!!p1,!!p2))+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray80",size=.15)+
    geom_mark_hull(aes(group=k),color=pal1[p$k],fill=pal1[p$k],concavity=100,radius=unit(.15,"cm"),expand=unit(. 15,"cm"),alpha=.2,size=.15)+
    geom_point(aes(x=!!p1,y=!!p2),color=pal1[p$k],size=.3)+
    geom_text(aes(x=!!p1,y=!!p2),label=rownames(p),col or=pal2[p$k],size=2,vjust=-.7)+
    labs(x=pct[i],y=pct[i+1])+
    scale_x_continuous(breaks=seq(-1,1,.02))+
    scale_y_continuous(breaks=seq(-1,1,.02))+
    theme(
    aspect.ratio=1,
    axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
    axis.text=element_text(color="black",size=6),
    axis.ticks.length=unit(0,"pt"),
    axis.ticks=element_blank(),
    axis.title=element_text(color="black",size=8),
    legend.position="none",
    panel.background=element_rect(fill="gray40"),
    panel.border=element_rect(color="gray30",fill=NA,s ize=.4),
    panel.grid.major=element_line(color="gray30",size= .2),
    panel.grid.minor=element_blank(),
    plot.background=element_rect(fill="gray40",color=N A)
    )

    ggsave(paste0(i,".png"),width=7,height=7)
    }




    The commands for calculating FST I posted earlier listed population names in the wrong order, but here's a fixed version:

    Code:
    x=eurasiafst
    igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref' )
    igno<v44.3_HO_public.ind|grep -Ev '\.SG|\.WGA|_o'|awk '{print$1}'|awk -F\\t 'NR==FNR{a[$0];next}FNR>1&&$2 in a' - v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '(!a[$3]++)&&$6==0&&$15>5e5&&$11>20{print$2,$8}'>$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
    awk '!a[$2]++{i++}{print$1,i}' <(sort -k2 $x.pick)|awk 'NR==FNR{a[$1]=$2;next}{$6=a[$2]+9}1' - $x.fam>$x.fam.temp;mv $x.fam{.temp,}
    plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .1 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
    smartpca -p <(printf %s\\n genotypename:\ $x.p.bed snpname:\ $x.p.bim indivname:\ $x.p.fam fstonly:\ YES fsthiprecision:\ YES maxpops:\ 10000)|tee $x.smartpca
    p=$(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]}' $x.{pick,fam}|awk '!a[$0]++')
    sed -n '/fst \*1000000/,/^$/p' $x.smartpca|sed 1,2d|sed \$d|tr -s \ ,|cut -d, -f3-|paste -d, <(printf %s\\n "$p") -|cat <(printf %s\\n '' "$p"|paste -sd, -) ->$x.fst


    The undocumented parameter `fsthiprecision: YES` causes the FST values that are printed to STDOUT to be multiplied by million instead of thousand. It does not affect the contents of the file specified using a `phylipoutname` parameter.


    When SmartPCA is used to calculate FST distances, the sixth field of the fam file needs to contain population numbers: https://www.biostars.org/p/266511/. The commands above use integers starting from 10 as the population numbers, because the numbers 1, 2, and 9 have a special meaning (1 assigns the sample as a case, 2 assigns it as a control, and 9 ignores it).

  14. #14
    Regular Member Archetype0ne's Avatar
    Join Date
    11-06-18
    Posts
    715

    Y-DNA haplogroup
    J2b2-L283/FT29003

    Country: Albania



    The set of commands below selects some ancient samples from the 1240K+HO dataset with over 300,000 SNPs. It then runs ADMIXTURE at K=2 and K=3, and it shows the percentage of the East Eurasian component in both runs.

    Code:
    printf %s\\n Armenia_EBA Azerbaijan_Caucasus_lowlands_LN Belgium_UP_GoyetQ116_1_published_all Canada_BigBar_5700BP.SG China_HMMH_MN China_NEastAsia_Coastal_EN China_NEastAsia_Inland_EN China_SEastAsia_Island_EN China_Shimao_LN China_Tianyuan China_Upper_YR_LN China_WLR_MN China_YR_LN China_YR_MN Croatia_N_Starcevo Czech_Baalberge Czech_Vestonice16 England_Mesolithic_all.SG England_N_all.SG Georgia_Kotias.SG Georgia_Satsurblia.SG Germany_EN_LBK Hungary_EN_HG_Koros Iran_BA1_ShahrISokhta Iran_C_TepeHissar Iran_ShahrISokhta_BA2_published Ireland_N.SG Italy_Mesolithic.SG Italy_North_Villabruna_HG Japan_HG_Jomon Kazakhstan_Botai.SG Kazakhstan_Botai_Eneolithic.SG Kazakhstan_Central_Steppe_EMBA.SG Kazakhstan_Dali_EBA Kazakhstan_Mereke_MBA_o2 Latvia_HG.SG Latvia_MN Latvia_MN_o2 Lithuania_EMN_Narva Luxembourg_Loschbour_published.DG Mongolia_Chalcolithic_Afanasievo_1 Mongolia_EBA_Chemurchek_2 Mongolia_EBA_Ulgii_1 Mongolia_East_N Mongolia_North_N Norway_Mesolithic.SG Norway_N_HG.SG Poland_BKG.SG Poland_TRB.SG Romania_IronGates_Mesolithic Russia_Afanasievo Russia_Afanasievo_published Russia_BA_Okunevo.SG Russia_DevilsCave_N.SG Russia_HG_Karelia Russia_HG_Sosnoviy Russia_HG_Tyumen Russia_Kolyma_M.SG Russia_Kostenki14 Russia_Lokomotiv_Eneolithic.SG Russia_MA1_HG.SG Russia_MBA_Poltavka Russia_MN_Boisman Russia_Potapovka_o1 Russia_Samara_EBA_Yamnaya Russia_Shamanka_Eneolithic.SG Russia_Siberia_Irkutsk_EBA Russia_Siberia_Lena_EBA Russia_Siberia_Lena_EN Russia_Siberia_Tenisei_EBA Russia_Siberia_UP Russia_Sidelkino_HG.SG Russia_Steppe_Maikop Russia_Sunghir1.SG Russia_Sunghir2.SG Russia_Sunghir3.SG Russia_Sunghir4.SG Russia_UstBelaya_Angara Russia_UstBelaya_Angara_published Russia_UstIda_LN.SG Russia_Ust_Ishim_HG_published.DG Russia_Yana_UP.SG Serbia_IronGates_Mesolithic Spain_C Spain_ElMiron Spain_HG.SG Spain_HG_published.SG Sweden_Ansarve_Megalithic.SG Sweden_Mesolithic.SG Sweden_Motala_HG.SG Switzerland_Bichon.SG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>popSG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>popSG Turkey_N_published Turkmenistan_C_Geoksyur Turkmenistan_EBA_Parkhai Turkmenistan_Gonur_BA_1 Turkmenistan_MBA_Parkhai USA_Ancient_Beringian.SG USA_Anzick_realigned.SG Ukraine_Mesolithic Ukraine_N Uzbekistan_Bustan_Eneolithic>pop
    x=tophog8;awk -F\\t 'NR==FNR{a[$0];next}$8 in a&&$15>3e5&&!a[$3]++{print$2,$8}' pop v44.3_HO_public.anno>$x.pick
    alias pli='plink --allow-no-sex'
    pli --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
    pli --bfile $x --indep-pairwise 50 10 .05 --out $x;pli --bfile $x --extract $x.prune.in --make-bed --out $x.p
    tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
    for k in 2 3;do admixture -j4 -C .1 $x.p.bed $k;awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.{pick,fam}|paste -d' ' - $x.p.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.${k}a;done
    paste -d' ' $x.[23]a|awk '{printf"%f %s\n",$3+$5,$0}'|sort -n|awk '{printf"%.0f %.0f %s\n",100*$4,100*$6,$2}'

    Output:

    0 0 England_Mesolithic_all.SG
    0 0 Hungary_EN_HG_Koros
    0 0 Italy_North_Villabruna_HG
    0 0 Latvia_HG.SG
    0 0 Latvia_MN
    0 0 Lithuania_EMN_Narva
    0 0 Luxembourg_Loschbour_published.DG
    0 0 Romania_IronGates_Mesolithic
    0 0 Switzerland_Bichon.SG
    0 0 Serbia_IronGates_Mesolithic
    0 0 Italy_Mesolithic.SG
    0 0 England_N_all.SG
    0 0 Spain_C
    1 0 Sweden_Ansarve_Megalithic.SG
    1 0 Czech_Baalberge
    1 0 Poland_TRB.SG
    2 0 Germany_EN_LBK
    1 2 Ukraine_N
    3 0 Ireland_N.SG
    0 3 Spain_HG_published.SG
    0 3 Sweden_Motala_HG.SG
    3 0 Turkey_N_published
    3 0 Poland_BKG.SG
    3 0 Croatia_N_Starcevo
    1 3 Sweden_Mesolithic.SG
    3 6 Spain_HG.SG
    3 6 Norway_Mesolithic.SG
    5 5 Ukraine_Mesolithic
    4 6 Norway_N_HG.SG
    13 0 Armenia_EBA
    13 0 Azerbaijan_Caucasus_lowlands_LN
    7 7 Latvia_MN_o2
    14 3 Russia_Afanasievo_published
    13 4 Russia_Samara_EBA_Yamnaya
    14 4 Mongolia_Chalcolithic_Afanasievo_1
    14 4 Russia_Afanasievo
    9 11 Spain_ElMiron
    19 1 Georgia_Satsurblia.SG
    15 7 Russia_MBA_Poltavka
    19 2 Iran_C_TepeHissar
    19 3 Georgia_Kotias.SG
    20 3 Turkmenistan_MBA_Parkhai
    17 7 Kazakhstan_Mereke_MBA_o2
    21 4 Uzbekistan_Bustan_Eneolithic
    21 6 Turkmenistan_EBA_Parkhai
    22 7 Turkmenistan_C_Geoksyur
    23 7 Turkmenistan_Gonur_BA_1
    16 15 Russia_HG_Karelia
    24 8 Iran_BA1_ShahrISokhta
    16 16 Russia_Sidelkino_HG.SG
    19 16 Czech_Vestonice16
    24 20 Russia_Sunghir1.SG
    25 20 Russia_Sunghir2.SG
    26 19 Russia_Steppe_Maikop
    27 19 Russia_Potapovka_o1
    26 21 Russia_Kostenki14
    26 22 Russia_Sunghir4.SG
    27 23 Russia_Sunghir3.SG
    26 25 Belgium_UP_GoyetQ116_1_published_all
    34 26 Mongolia_EBA_Chemurchek_2
    33 28 Russia_MA1_HG.SG
    35 31 Russia_HG_Sosnoviy
    38 31 Kazakhstan_Dali_EBA
    43 31 Iran_ShahrISokhta_BA2_published
    41 37 Russia_HG_Tyumen
    44 41 Kazakhstan_Botai.SG
    Russia_Yana_UP.SG 49 45
    50 45 Russia_Ust_Ishim_HG_published.DG
    49 Kazakhstan_Botai_Eneolithic.SG 46
    62 59 Russia_BA_Okunevo.SG
    63 Kazakhstan_Central_Steppe_EMBA.SG 60
    71 68 China_Tianyuan
    77 Russia_Siberia_Tenisei_EBA 75
    82 81 USA_Ancient_Beringian.SG
    83 82 USA_Anzick_realigned.SG
    84 Canada_BigBar_5700BP.SG 82
    86 84 Japan_HG_Jomon
    88 86 China_SEastAsia_Island_EN
    89 88 Mongolia_EBA_Ulgii_1
    90 90 Russia_Kolyma_M.SG
    97 96 Russia_Siberia_Lena_EN
    99 98 Russia_UstIda_LN.SG
    99 99 Russia_Siberia_Lena_EBA
    100 99 Russia_UstBelaya_Angara
    100 100 China_HMMC_Nast
    China
    China_NEastAsia_Inland_EN 100 100
    100 100 China_Shimao_LN
    100 100 China_Upper_YR_LN
    100 100 China_WLR_MN
    100 100 China_YR_LN
    100 100 China_YR_MN
    100 100 Mongolia_East_N
    100 100 Mongolia_North_N
    100 100 Russia_DevilsCave_N.SG
    100 100 Russia_Lokomotiv_Eneolithic.SG
    100 100 Russia_MN_Boisman
    100 100 Russia_Shamanka_Eneolithic.SG
    100 100 Russia_Siberia_Irkutsk_EBA
    100 100 Russia_UstBelaya_Angara_published



    This combines the Q matrices of ADMIXTURE runs at different K values into a single wide matrix. It then uses michal3141's convex optimization script to model a selected population using other populations (https://github.com/michal3141/g25):

    Code:
    $ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(n in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
    $ x=world4;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]}' $x.{pick,fam}) $x.p.*.Q|tav ' '|tr \ ,|awk -F, 'NR==1{for(i=1;i<NF;i++)printf",";print""}1'>/tmp/a
    $ curl -sO https://github.com/michal3141/g25/raw/main/mixture.py
    $ pip3 install cvxpy
    [...]
    $ python3 mixture.py <(grep -v Mansi /tmp/a) <(head -n1 /tmp/a;grep Mansi, /tmp/a)

    -------------- ANCESTRY BREAKDOWN: -------------
    Enets ---> 44.464%
    Saami.DG ---> 41.558%
    Tatar_Siberian_Zabolotniye ---> 9.567%
    Koryak ---> 2.165%
    Eskimo_Sireniki.DG ---> 1.691%
    Relli_2.DG ---> 0.282%
    Kalash.SDG ---> 0.233%
    Yadava.DG ---> 0.040%
    ------------------------------------------------
    Fit error: 0.042235320141667455




    End of copying threads
    . I will have to go back and format the code. Since I am so incompetent it took me 1 hr to just copy these posts, and still couldn't figure out how to format the code like this:



    Maybe Jovialis can guide me how to format the code that way?

    Huge thanks to Nganasankhan for allowing me to share his code on Eupedia. I can only imagine how much time and skill he put into this.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •