Admixtools qpAdm_binary_evaluation_script.py function

Jovialis

Advisor
Messages
9,318
Reaction score
5,902
Points
113
Ethnic group
Italian
Y-DNA haplogroup
R-PF7566 (R-Y227216)
mtDNA haplogroup
H6a1b7

Here's a script function that gives the binary evaluation of PASS / FAIL for p-values.

0.05-1 will yield a "PASS" score, the range of biological relevance.

Anything outside of 0.05-1 receives a "FAIL" score.


Q9HJAdh.png


I created a tool that would help me decipher what the p-values meant according to the standards of modern archeogenomics. So I had ChatGPT analyze the supplement of Harney et al supplement on qpAdm, which is the de facto handbook. Then I had it help me create a test based on the standards with python script. Now it is an independent script function I use to help me understand p-values, but asking the AI to run the scoring system against a p-value output from Admixtools.
Scoring System Explanation:
- A p-value of 0.05 receives the highest score of 100, as this is the conventional threshold for statistical significance and suggests a high confidence in the biological relevance of the observed result.
- p-values between 0 and 0.05 increase linearly in score from 90 to 100, indicating increasing confidence in the biological relevance as the p-value decreases.
- p-values between 0.05 and 0.1 decrease linearly in score from 100 to 90, indicating decreasing confidence in the biological relevance as the p-value increases above 0.05.
- p-values between 0.1 and 1 decrease linearly in score from 90 to 0, suggesting progressively lower confidence in the biological relevance.

Biological Relevance Context:
The test is specifically designed to determine if a particular parameter falls within a biologically relevant range. A p-value provides insight into how confidently we can assert that the observed result is biologically relevant and not due to random chance alone. Lower p-values indicate stronger evidence against the null hypothesis and hence a higher confidence in the biological relevance of the observed result.

Disclaimers:
1. A p-value does not provide information about the practical or clinical significance.
2. A low p-value in a large sample might indicate statistical but not practical significance.
3. Multiple testing without correction can lead to misleading p-values.
4. Extremely low p-values in certain contexts might be too good to be true.
5. The scoring system is a guideline for interpreting p-values and should be used in conjunction with broader study design and objectives.

Sources: E Harney et al. 2021, Assessing the performance of qpAdm: a statistical tool for studying population admixture
 
Last edited:
This is the problem with LLM's. They sometimes churn outputs that seem right, because such output is correct for most of the cases, but for the specific query intended is completely wrong.

So when it comes to p-value, assessing it depends a lot on how an experiment is constructed, that's the main factor in falsifying the null hypothesis.

So... while for most of the academic papers one reads, a p<0.05 implies a confidence interval of 95%, for qpAdm this is not the case.

Lets have a look at the really good paper you use as a source.

"Distribution of p-values. qpAdm outputs a p-value that is used to determine whether a specific model of population history can be considered plausible. Models are rejected, or regarded as implausible, when the p-value is below the chosen significance cutoff (typically 0.05)."

"When an appropriate admixture model is suggested, qpAdm calculates P-values that follow a uniform distribution, suggesting that a cut-off value of 0.05 will result in the acceptance of a correct model in 95% of cases." Harney et al. 2021

- This means for a model to be accepted in qpADM with a 95% confidence interval, you are looking for a p>0.05, and not p<0.05.

Bottom line is, no script is needed for a one line if then statement.
Hope this helps.

PS. This confused me as much as it did GPT when I first got into qpADM, as I thought that a p value of <0.05 was what I was looking for, due to what I had encountered in other fields.
 
This is the problem with LLM's. They sometimes churn outputs that seem right, because such output is correct for most of the cases, but for the specific query intended is completely wrong.

So when it comes to p-value, assessing it depends a lot on how an experiment is constructed, that's the main factor in falsifying the null hypothesis.

So... while for most of the academic papers one reads, a p<0.05 implies a confidence interval of 95%, for qpAdm this is not the case.

Lets have a look at the really good paper you use as a source.

"Distribution of p-values. qpAdm outputs a p-value that is used to determine whether a specific model of population history can be considered plausible. Models are rejected, or regarded as implausible, when the p-value is below the chosen significance cutoff (typically 0.05)."

"When an appropriate admixture model is suggested, qpAdm calculates P-values that follow a uniform distribution, suggesting that a cut-off value of 0.05 will result in the acceptance of a correct model in 95% of cases." Harney et al. 2021

- This means for a model to be accepted in qpADM with a 95% confidence interval, you are looking for a p>0.05, and not p<0.05.

Bottom line is, no script is needed for a one line if then statement.
Hope this helps.

PS. This confused me as much as it did GPT when I first got into qpADM, as I thought that a p value of <0.05 was what I was looking for, due to what I had encountered in other fields.

What he said, you want p-value higher than 5%.
 
So basically, anything between 0.05-1 is acceptable, and 1 would be ideal.
Pretty much. The higher p value in this case would mean a greater statistical significance.
 
So basically, anything between 0.05-1 is acceptable, and 1 would be ideal.
When reading the study, they caution against that. Here's where they talk about it:

"While the overall distributions of P-values differ between optimal and nonoptimal qpAdm models, we note that for individual replicates the most optimal model is not necessarily assigned the highest P-value. We find that the P-value associated with the best model (sources 5 and 9) produces the highest P-value in only 48% of cases (Supplementary Table S5), when the standard reference set is used (13, 12, 10, 7, and 0). In frequentist methods such as qpAdm, P-values below the nominal significance level are judged wrong enough to be rejected, but P-values do not represent probabilities of models being correct. As Figure 2 shows, qpAdm is fairly conservative in rejecting models. For example, the model which posits populations 4 and 9 as sources may be considered wrong because population 4 is more closely related to source population 0 than it is to the target population 14. Still, P-values under this model are almost uniformly distributed (Figure 2B) and for a given data set the P-value for this model could easily be larger than the P-value for the correct model (Figure 2A). In contrast, models that diverge strongly from the truth are always rejected, as when populations 11 and 9 are used as sources (Figure 2F). Therefore, in cases where multiple models are assigned plausible P-values (i.e., P ≥ 0.05), we caution that P-value ranking (i.e., selecting the model that is assigned the highest P-value) should not be used to identify the best model. Methods for distinguishing between multiple models will be discussed further in the section on comparing admixture models."
 
What I am gathering is as long as the p-value is within the range of 0.05-1 it is essentially a binary yes it is plausible, vs no it is not, if it falls outside that range. Moreover, that other metrics, like se, z, dof, and chisq are more indicative of the viability of the model?
 
If one were to make a test, it should then start with the p-value being a binary plausible vs not plausible that would award 65 points, or a D -

that way it serves as a baseline for passing or failing that is necessary, while the other metrics (se, z, dof, and chisq) will be crucial for bring it up from the D range up to A+
 
If one were to make a test, it should then start with the p-value being a binary plausible vs not plausible that would award 65 points, or a D -

that way it serves as a baseline for passing or failing that is necessary, while the other metrics (se, z, dof, and chisq) will be crucial for bring it up from the D range up to A+
Might as well remove the grades and just keep the binary accept/fail metric. There is so many variables at play, that any comparative analysis of the quality is not feasible, variables like sample quality, snp count, sequencing method, and many more I can't think of that affect p and everything else. Meaning assigning grades for the purpose of comparing two viable runs based on the output of the run alone, is not enough.

I think the paper that you shared is pretty much the state of the art of where this exercise is at, and that is a pass/fail threshold. Then assessing the models from a historical perspective. Just take everything I say with a grain of salt, I am a total pleb at this stuff, and only got into it to analyze my own data.
 

Here's a script function that gives the binary evaluation of PASS / FAIL for p-values.

0.05-1 will yield a "PASS" score, the range of biological relevance.

Anything outside of 0.05-1 receives a "FAIL" score.
 
Archetype is correct.

Standard errors are where it's at, after you pass with your p-value.
 

This thread has been viewed 1772 times.

Back
Top