Admixtools Using Admixtools2 to model admixture

Tautalus

Regular Member
Messages
239
Reaction score
263
Points
63
Ethnic group
Portuguese
Y-DNA haplogroup
I2-M223 / I-FTB15368
mtDNA haplogroup
H6a1b2
I've been using Admixtools2 qpAdm since last week to model my admixture.

I started with the basics, EEF, Steppe and WHG percentages.
Code:
target = 'Tautalus'
left= c('Turkey_Barcin_LN.SG','Russia_Samara_EBA_Yamnaya','Luxembourg_Loschbour.DG')
right = c('Mbuti.DG', 'Ethiopia_4500BP.SG', 'Russia_Ust_Ishim.DG', 'Czech_Vestonice16', 'Belgium_UP_GoyetQ116_1', 'Russia_Kostenki14.SG', 'Russia_AfontovaGora3', 'Italy_North_Villabruna_HG', 'Han.DG', 'Papuan.DG', 'Karitiana.DG', 'Georgia_Satsurblia.SG', 'Iran_GanjDareh_N', 'Turkey_Epipaleolithic', 'Jordan_PPNB', 'Russia_Karelia_HG.SG', 'Russia_Steppe_Eneolithic', 'Czech_CordedWare', 'Armenia_LBA.SG', 'ONG.SG')
results = qpadm(prefix, left, right, target, allsnps = TRUE)
results$weights
results$popdrop

These are the results. The p-value is decent.
9drCbCM.png

MFeakgh.png

They are similar to what I get with G25.
IIaWMcZ.png

So far it looks good.
 
I've been using Admixtools2 qpAdm since last week to model my admixture.

I started with the basics, EEF, Steppe and WHG percentages.
Code:
target = 'Tautalus'
left= c('Turkey_Barcin_LN.SG','Russia_Samara_EBA_Yamnaya','Luxembourg_Loschbour.DG')
right = c('Mbuti.DG', 'Ethiopia_4500BP.SG', 'Russia_Ust_Ishim.DG', 'Czech_Vestonice16', 'Belgium_UP_GoyetQ116_1', 'Russia_Kostenki14.SG', 'Russia_AfontovaGora3', 'Italy_North_Villabruna_HG', 'Han.DG', 'Papuan.DG', 'Karitiana.DG', 'Georgia_Satsurblia.SG', 'Iran_GanjDareh_N', 'Turkey_Epipaleolithic', 'Jordan_PPNB', 'Russia_Karelia_HG.SG', 'Russia_Steppe_Eneolithic', 'Czech_CordedWare', 'Armenia_LBA.SG', 'ONG.SG')
results = qpadm(prefix, left, right, target, allsnps = TRUE)
results$weights
results$popdrop

These are the results. The p-value is decent.
9drCbCM.png

MFeakgh.png

They are similar to what I get with G25.
IIaWMcZ.png

So far it looks good.
The p-value is actually failing. This might be counterintuitive, but in qpADM you are looking for >.05 p.
 
You are right.
pchisq(q=67.7, df=17, lower.tail=FALSE) = 5.34725e-08 which is 0.00000005.
So it's way off. Back to the drawing board.
 
You are right.
pchisq(q=67.7, df=17, lower.tail=FALSE) = 5.34725e-08 which is 0.00000005.
So it's way off. Back to the drawing board.
Try not to mix different sequenced samples on the left tail(.DG, .SG, noUDG etc). At least that is the advice I got when I first started playing with these.
Are you using 23andme raw data for yourself?
 
Try not to mix different sequenced samples on the left tail(.DG, .SG, noUDG etc). At least that is the advice I got when I first started playing with these.
Are you using 23andme raw data for yourself?
Thanks, I'll try that.
Yes, I merged my 23andme raw data with the data from the Reich Lab.

I'm happy to see more people getting involved in qpAdm. Best of luck to you, Tautalus!
Thanks Jovialis.
 
I've been using Admixtools2 qpAdm since last week to model my admixture.

I started with the basics, EEF, Steppe and WHG percentages.
Code:
target = 'Tautalus'
left= c('Turkey_Barcin_LN.SG','Russia_Samara_EBA_Yamnaya','Luxembourg_Loschbour.DG')
right = c('Mbuti.DG', 'Ethiopia_4500BP.SG', 'Russia_Ust_Ishim.DG', 'Czech_Vestonice16', 'Belgium_UP_GoyetQ116_1', 'Russia_Kostenki14.SG', 'Russia_AfontovaGora3', 'Italy_North_Villabruna_HG', 'Han.DG', 'Papuan.DG', 'Karitiana.DG', 'Georgia_Satsurblia.SG', 'Iran_GanjDareh_N', 'Turkey_Epipaleolithic', 'Jordan_PPNB', 'Russia_Karelia_HG.SG', 'Russia_Steppe_Eneolithic', 'Czech_CordedWare', 'Armenia_LBA.SG', 'ONG.SG')
results = qpadm(prefix, left, right, target, allsnps = TRUE)
results$weights
results$popdrop

These are the results. The p-value is decent.
9drCbCM.png

MFeakgh.png

They are similar to what I get with G25.
IIaWMcZ.png

So far it looks good.

So how do you use qpAdm? Do you need one's raw dna file or just G25 coords?
 
qpAdm models a target population as a mixture of left (source) populations, given a set of right (reference) populations.
Choosing the reference populations can significantly impact the results of the qpAdm analysis, so it is important to make this selection well. So I made some adjustments to the model according to the following qpAdm assumptions and requirements.

The fundamental assumptions of qpAdm are:
1) “There are no gene flows between lineages unique to candidate source populations (post their divergence from the actual admixing populations) and the reference populations”, that is, no gene flow occurs between source and references populations following the split of the source population from the true lineage that participated in the admixture event.
2) “There are no gene flows from the fully formed target lineage to reference populations.”

It is crucial to select reference populations that are genetically distinct and not directly ancestral to the target population to avoid biasing admixture estimates. Ideally, as said before, the reference populations should have no gene flow connecting them to the private lineages of the candidate source populations (after their divergence from the true admixed populations).
Likewise, there should be no gene flows from the fully formed target lineage to reference populations.

The qpAdm method requires also differential relatedness, that is, it requires that at least one population in the reference set is differentially related to those in the source set, which “means that at least some reference populations must be more closely related to some source populations than to others.

Code:
target = 'Tautalus'
left= c('Germany_EN_LBK_Stuttgart.DG','Russia_Samara_EBA_Yamnaya','Luxembourg_Loschbour.DG')
right = c('Mbuti.DG', 'Ethiopia_4500BP.SG', 'Han.DG', 'Papuan.DG', 'Karitiana.DG', 'Georgia_Satsurblia.SG', 'Iran_GanjDareh_N', 'Jordan_PPNB','Russia_Kostenki14.SG','Russia_Ust_Ishim.DG','Armenia_LBA.SG', 'ONG.SG')
results = qpadm(prefix, left, right, target, allsnps = TRUE)
results$weights
results$popdrop

The p-value is better, but it is not a definitive model, there is a lot to learn and improve, it is a work in progress.

H79SWu2.png


bqMC527.png


Once again similar values to what I get with G25.
ZpAcOim.png
 
Last edited:
qpAdm models a target population as a mixture of left (source) populations, given a set of right (reference) populations.
Choosing the reference populations can significantly impact the results of the qpAdm analysis, so it is important to make this selection well. So I made some adjustments to the model according to the following qpAdm assumptions and requirements.

The fundamental assumptions of qpAdm are:
1) “There are no gene flows between lineages unique to candidate source populations (post their divergence from the actual admixing populations) and the reference populations”, that is, no gene flow occurs between source and references populations following the split of the source population from the true lineage that participated in the admixture event.
2) “There are no gene flows from the fully formed target lineage to reference populations.”

It is crucial to select reference populations that are genetically distinct and not directly ancestral to the target population to avoid biasing admixture estimates. Ideally, as said before, the reference populations should have no gene flow connecting them to the private lineages of the candidate source populations (after their divergence from the true admixed populations).
Likewise, there should be no gene flows from the fully formed target lineage to reference populations.

The qpAdm method requires also differential relatedness, that is, it requires that at least one population in the reference set is differentially related to those in the source set, which “means that at least some reference populations must be more closely related to some source populations than to others.

Code:
target = 'Tautalus'
left= c('Germany_EN_LBK_Stuttgart.DG','Russia_Samara_EBA_Yamnaya','Luxembourg_Loschbour.DG')
right = c('Mbuti.DG', 'Ethiopia_4500BP.SG', 'Han.DG', 'Papuan.DG', 'Karitiana.DG', 'Georgia_Satsurblia.SG', 'Iran_GanjDareh_N', 'Jordan_PPNB','Russia_Kostenki14.SG','Russia_Ust_Ishim.DG','Armenia_LBA.SG', 'ONG.SG')
results = qpadm(prefix, left, right, target, allsnps = TRUE)
results$weights
results$popdrop

The p-value is better, but it is not a definitive model, there is a lot to learn and improve, it is a work in progress.

H79SWu2.png


bqMC527.png


Once again similar values to what I get with G25.
ZpAcOim.png

Try out my tail.
right = c('Cameroon_SMA', 'Czech_Vestonice16', 'Belgium_UP_GoyetQ116_1', 'Russia_West_Siberia_HG', 'Serbia_IronGates_Mesolithic', 'Karitiana.DG', 'Papuan.DG', 'Iran_GanjDareh_N', 'Russia_Boisman_MN', 'Romania_C_Bodrogkeresztur', 'Croatia_MLBA', 'Netherlands_EIA', 'Russia_Samara_EBA_Yamnaya', 'Czech_CordedWare', 'Lithuania_EMN_Narva', 'Turkey_Arslantepe_LateC', 'Israel_C', 'Iraq_PPNA', 'ONG.SG')

You might want to change Narva EMN to Denmak LN, Irongates mesothlithic to Villanueva WHG, and Croatia_MLBA to Spain IA, these adjustment would make sense for Italians.
 
For outgroups I would recommend trying to replicate what established studies on the population you are analyzing have done as a baseline. They have a specific methodology for selecting them.

Your model looks statistically robust. Good job. As long as it is between .05 to 1 it is plausible. The only way to verify a more definitive model would be to use other metrics that support it. The most important one is currently unavailable to us, ancIBD.
 
Try out my tail.
right = c('Cameroon_SMA', 'Czech_Vestonice16', 'Belgium_UP_GoyetQ116_1', 'Russia_West_Siberia_HG', 'Serbia_IronGates_Mesolithic', 'Karitiana.DG', 'Papuan.DG', 'Iran_GanjDareh_N', 'Russia_Boisman_MN', 'Romania_C_Bodrogkeresztur', 'Croatia_MLBA', 'Netherlands_EIA', 'Russia_Samara_EBA_Yamnaya', 'Czech_CordedWare', 'Lithuania_EMN_Narva', 'Turkey_Arslantepe_LateC', 'Israel_C', 'Iraq_PPNA', 'ONG.SG')

You might want to change Narva EMN to Denmak LN, Irongates mesothlithic to Villanueva WHG, and Croatia_MLBA to Spain IA, these adjustment would make sense for Italians.

I do not have Russia_West_Siberia_HG in the Reich data. It must have been renamed.
Is it Russia_Siberia_Angara_EN, Russia_Siberia_Lena_EN , Russia_Siberia_UP ?
I had to remove Russia_Samara_EBA_Yamnaya from the reference populations because it was part of the source group.
Before modifications with a p-value of 2%. There were slight variations in the percentages of EEF and WHG.
These reference populations are still not ideal for me.

dZTH3bq.png


After modifications ((Narva EMN to Denmak LN, Irongates mesothlithic to Villabruna WHG, and Croatia_MLBA to Spain IA) with more slight variations in the percentages of EEF and WHG and a p-value of 0.03%. Even worse.
39kGltn.png


It's a matter of trial and error until finding the most correct reference populations.

For outgroups I would recommend trying to replicate what established studies on the population you are analyzing have done as a baseline. They have a specific methodology for selecting them.

Your model looks statistically robust. Good job. As long as it is between .05 to 1 it is plausible. The only way to verify a more definitive model would be to use other metrics that support it. The most important one is currently unavailable to us, ancIBD.
Right now I'm in the Valley of Despair, any advice on how to get out is welcome. :)
 
I do not have Russia_West_Siberia_HG in the Reich data. It must have been renamed.
Is it Russia_Siberia_Angara_EN, Russia_Siberia_Lena_EN , Russia_Siberia_UP ?
I had to remove Russia_Samara_EBA_Yamnaya from the reference populations because it was part of the source group.
Before modifications with a p-value of 2%. There were slight variations in the percentages of EEF and WHG.
These reference populations are still not ideal for me.

dZTH3bq.png


After modifications ((Narva EMN to Denmak LN, Irongates mesothlithic to Villabruna WHG, and Croatia_MLBA to Spain IA) with more slight variations in the percentages of EEF and WHG and a p-value of 0.03%. Even worse.
39kGltn.png


It's a matter of trial and error until finding the most correct reference populations.


Right now I'm in the Valley of Despair, any advice on how to get out is welcome. :)

I forgot about that, I borrowed that naming from the Albanian paper, it is made of two samples
I5766 and I1960

When I try out different substitutes I usually do them one at a time to see if the fit worsens or degenerates. Also either remove Romania_C or replace it with France CA.

I personally don't agree with your left. It's probably never going to work. Your should try dutch beaker plus Iberian neolithic/calcolithic as a model for your ancestry.
 
Last edited:
I do not have Russia_West_Siberia_HG in the Reich data. It must have been renamed.
Is it Russia_Siberia_Angara_EN, Russia_Siberia_Lena_EN , Russia_Siberia_UP ?
I had to remove Russia_Samara_EBA_Yamnaya from the reference populations because it was part of the source group.
Before modifications with a p-value of 2%. There were slight variations in the percentages of EEF and WHG.
These reference populations are still not ideal for me.

dZTH3bq.png


After modifications ((Narva EMN to Denmak LN, Irongates mesothlithic to Villabruna WHG, and Croatia_MLBA to Spain IA) with more slight variations in the percentages of EEF and WHG and a p-value of 0.03%. Even worse.
39kGltn.png


It's a matter of trial and error until finding the most correct reference populations.


Right now I'm in the Valley of Despair, any advice on how to get out is welcome. :)
I'm still at the bottom of the slope imo, but I'll never reach the top, since I have no plans on becoming a geneticist that produces peer-reviewed papers.
 
I forgot about that, I borrowed that naming from the Albanian paper, it is made of two samples
I5766 and I1960

When I try out different substitutes I usually do them one at a time to see if the fit worsens or degenerates. Also either remove Romania_C or replace it with France CA.

I personally don't agree with your left. It's probably never going to work. Your should try dutch beaker plus Iberian neolithic/calcolithic as a model for your ancestry.
I agree with you, I'm moving on, I'm going to test others source populations, like the ones you sugested.

I'm still at the bottom of the slope imo, but I'll never reach the top, since I have no plans on becoming a geneticist that produces peer-reviewed papers.
Neither do I, I don't want to get to the top of the Plateau, I just want to go up the slope a little to get some enlightenment to create reasonably good and confirmed models as you said.
 
Indeed, believe it or not, but genetics is just one of my many hobbies. Professionally, I'm involved in technology.

Unless you're working to cure cancer or start the next 23andme, being a geneticist doesn't really pay. Though some can make a respectable living. I treat it like my art hobby, I am fairly adept, I'm passionate about it, but I would never pursue it professionally.

That being said, I do think it is one of the most interesting topics, because it is a crossroads with many other topics I like, i.e. history and archaeology.

Being a historian, or an archeologist pay even less than a geneticist.

If people want to make a lot of money, become a biomedical, aerospace, chemical, nuclear, or petroleum engineer, surgeon, investment banker, or a corporate lawyer.
 
Nowadays I don't have many hobbies, mainly due to lack of time, but I always try to make time for genetics.

It's fascinating stuff, it involves history and archaeology, as you said, topics that are dear to me. And it is more than just history, it directly involves each one of us, personally.
These are events that shaped us, that explain who we are and how we got here, to paraphrase David Reich's book.

As you I don’t have any intention to pursue this topic professionally, my motivation is simple, to understand who I am and where I came from. To understand the history of my people, how it was formed, how it fits (intertwine) into the history of the European peoples.

It has become and will continue to be my favorite subject, outside of my professional life.​
 
Nowadays I don't have many hobbies, mainly due to lack of time, but I always try to make time for genetics.

It's fascinating stuff, it involves history and archaeology, as you said, topics that are dear to me. And it is more than just history, it directly involves each one of us, personally.
These are events that shaped us, that explain who we are and how we got here, to paraphrase David Reich's book.

As you I don’t have any intention to pursue this topic professionally, my motivation is simple, to understand who I am and where I came from. To understand the history of my people, how it was formed, how it fits (intertwine) into the history of the European peoples.

It has become and will continue to be my favorite subject, outside of my professional life.​
Pretty much sums up me pursuing this as well as my study of history in my free time and understanding who I am and how it fits within history.
Edit: I'll add too that I enjoy looking at other people's results.
 
Last edited:
The trickiest part of all this is the merging of our own data with the data from the Reich Lab.

There are several ways of doing it, here is one of the fastest and simplest.

It’s the conversion of your raw data in 23andme format to Bed format, then to Geno format, then merging it with the Reich data.

All this instructions assume that all the programs and packages needed for the execution in DOS or in Wsl are already installed.
The names of the files can be whatever you want.

1) Convert raw data in 23andme format to bed file (DOS session)
plink --allow-no-sex --alleleACGT --23file 23andMe.txt --make-bed --out outfile

This will produce 3 essential files, a bed, a bim and a fam file. In the fam file you could replace the -9 for 1.

2) Convert the bed file to geno (Eigenstrat format) (Wsl session)
You need to have a parameter file, with whatever name you want. I name it par.BED2GENO.par, its content are :

genotypename: outfile.bed
snpname: outfile.bim
indivname: outfile.fam
outputformat: EIGENSTRAT
genotypeoutname: outfile.geno
snpoutname: outfile.snp
indivoutname: outfile.ind

After the parameter file is done execute the command : convertf -p par.BED2GENO.par
This will produce 3 files, a geno, a ind and a snp file. In the ind file you can replace “Control” by your own name or alias.

3) Merge your data with the Reich data (In Wsl)
You need to have a parameter file. I name it par.MERGEGENO.par, its content :

geno1: outfile.geno
snp1: outfile.snp
ind1: outfile.ind
geno2: v54.1.p1_1240K_public.geno
snp2: v54.1.p1_1240K_public.snp
ind2: v54.1.p1_1240K_public.ind
genooutfilename: merged.geno
snpoutfilename: merged.snp
indoutfilename: merged.ind
outputformat: EIGENSTRAT
docheck: YES
hashcheck: YES
strandcheck: YES

Then execute the command : mergeit -p par.MERGEGENO.par

Mergeit, according to the documentation, merges two data sets into a third, which has the union of the individuals and the intersection of the SNPs in the first two, which means that the final merged file will only have the SNPs that exist in both files and all the remaining SNPs will be discarded. This will produce a merged file smaller in number of SNPs, not in size, than the original Reich data, with all the info you need to model your admixture.

I compared the test results of this file with the test results of a merged file with all of Reich's data plus my data and they are identical.
This merge is the process that takes the longest, between half hour and an hour depending on the computer.

And that's it, after this process the merged files are ready to be used by qpadm. Now you have two datasets, one is the original Reich data, to test all the populations and the other is your merged files, to test your admixture.
 
Last edited:

This thread has been viewed 3697 times.

Back
Top