Comparing Ancient Greek populations to modern Greeks and Italians

Jovialis · Aug 13, 2023

@Palermo:

ITS2.HO M Ignore_Italian_South_1d.rel.HO
ITS4.HO F Italian_South.HO
ITS5.HO F Italian_South.HO
ITS7.HO M Italian_South.HO
BEL57.HO M Italian_South.HO

This is what the Reich Lab lists as Southern Italian in the ind. file

But one could edit them with a free software like Visual Studio Code to examine the samples individually. Basically qpADM combines the samples that all have the same label. So it is possible to rearrange them. I noticed many of them have very different (sometimes misleading/inaccurate) nomenclature for aDNA than is listed in the studies they originate from. For example the Central Italian Neolithic samples are listed as Italian_South_N, which is bizarre.

Jovialis · Aug 13, 2023

These are the Sicilian samples:

Jovialis · Aug 13, 2023

Here is an analysis of all the genetic cohorts from the Imperial era compared to Italian_South_ITS7.HO

Code:

> results$weights
# A tibble: 4 × 5
  target                left                                  weight    se      z
  <chr>                 <chr>                                  <dbl> <dbl>  <dbl>
1 Italian_South_ITS7.HO C7_Italy_Imperial_oCentralEuropean.SG -0.628 1.39  -0.452
2 Italian_South_ITS7.HO C6_Italy_Imperial_Central_Med.SG       3.82  4.97   0.769
3 Italian_South_ITS7.HO C5_Italy_Imperial_East_Med.SG         -2.49  3.16  -0.786
4 Italian_South_ITS7.HO C4_Italy_Imperial_Near_East.SG         0.293 0.994  0.295
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq        p f4rank C7_Italy_Imperial_oCentral…¹ C6_Italy_Imperial_Ce…² C5_Italy_Imperial_Ea…³ C4_Italy_Imperial_Ne…⁴ feasible best 
   <chr> <dbl> <dbl>  <dbl>    <dbl>  <dbl>                        <dbl>                  <dbl>                  <dbl>                  <dbl> <lgl>    <lgl>
 1 0000      0    17   7.95 9.68e- 1      3                      -0.628                    3.82                 -2.49                   0.293 FALSE    NA   
 2 0001      1    18   8.90 9.62e- 1      2                      -1.32                     6.35                 -4.03                  NA     FALSE    TRUE 
 3 0010      1    18  17.7  4.73e- 1      2                     -56.5                    139.                   NA                    -81.9   FALSE    TRUE 
 4 0100      1    18  27.1  7.65e- 2      2                       0.736                   NA                    -1.69                   1.95  FALSE    TRUE 
 5 1000      1    18  12.2  8.38e- 1      2                      NA                        2.07                 -2.03                   0.958 FALSE    TRUE 
 6 0011      2    19  68.0  1.99e- 7      1                      -0.0961                   1.10                 NA                     NA     FALSE    NA   
 7 0101      2    19 110.   9.22e-15      1                       0.354                   NA                     0.646                 NA     TRUE     NA   
 8 0110      2    19  62.9  1.34e- 6      1                       0.433                   NA                    NA                      0.567 TRUE     NA   
 9 1001      2    19  38.2  5.53e- 3      1                      NA                        1.86                 -0.864                 NA     FALSE    NA   
10 1010      2    19  62.9  1.31e- 6      1                      NA                        1.12                 NA                     -0.121 FALSE    NA   
11 1100      2    19  69.6  1.06e- 7      1                      NA                       NA                     2.70                  -1.70  FALSE    NA   
12 0111      3    20 196.   6.28e-31      0                       1                       NA                    NA                     NA     TRUE     NA   
13 1011      3    20  77.4  1.07e- 8      0                      NA                        1                    NA                     NA     TRUE     NA   
14 1101      3    20 239.   1.89e-39      0                      NA                       NA                     1                     NA     TRUE     NA   
15 1110      3    20 396.   1.45e-71      0                      NA                       NA                    NA                      1     TRUE     NA   
# ℹ abbreviated names: ¹C7_Italy_Imperial_oCentralEuropean.SG, ²C6_Italy_Imperial_Central_Med.SG, ³C5_Italy_Imperial_East_Med.SG,
#   ⁴C4_Italy_Imperial_Near_East.SG
# ℹ 3 more variables: dofdiff <dbl>, chisqdiff <dbl>, p_nested <dbl>
> 
>

ChatGPT:

This output appears to be from a statistical analysis, most likely related to genetic ancestry or population genetics. The analysis seems to be focused on determining the contributions of various ancient or historical populations (i.e., sources) to the present-day genetic makeup of "Italian_South_ITS7.HO" (likely a modern Southern Italian population or individual sample). Here's what I interpret from the provided output:

1. **results$weights**:
- This table shows the estimated weight (contribution) of each ancient source to the target (Italian_South_ITS7.HO).
- "weight" indicates the estimated percentage contribution from each source, though it's worth noting that some weights are negative, which typically isn't meaningful in a genetic context and might indicate issues with the model or the populations being tested.
- "se" is the standard error associated with each weight, which provides an idea of the uncertainty around the weight estimates.
- "z" is a z-score, calculated as the weight divided by the standard error. This gives an idea of how many standard deviations the estimate is from zero, which can be used to assess significance.

2. **results$popdrop**:
- This table seems to be from a model testing different combinations of source populations to see how well they fit the target population's genetic data.
- "pat" indicates which source populations are included in the model (1 for included, 0 for excluded).
- "wt" is the number of sources included in the model.
- "dof" stands for degrees of freedom, which is a measure used in statistical tests.
- "chisq" is the chi-squared statistic, which measures the goodness of fit of the model. A higher value usually indicates a worse fit.
- "p" is the p-value associated with the chi-squared statistic. Smaller p-values suggest that the model is a significant improvement over a null model.
- "f4rank" is likely a rank statistic based on the f4-statistic, commonly used in population genetics.
- The columns with abbreviated names show the estimated contributions (weights) of each source population for that particular model.
- "feasible" indicates if the model is feasible (i.e., if the weights make sense and there are no issues like negative contributions).
- "best" might be indicating the best-fitting models based on some criteria.

**Interpretation**:

From the data:
- The first ancient source (C7_Italy_Imperial_oCentralEuropean.SG) has a negative weight, which is not typically meaningful in this context.
- The second source (C6_Italy_Imperial_Central_Med.SG) has the highest positive weight, suggesting it might be the most significant contributor to the modern Italian_South_ITS7.HO population.
- Several models with different combinations of sources have been tested. Some are feasible, while others aren't.
- The models with the lowest p-values (especially those close to zero) suggest that those combinations of sources are the most likely contributors to the Italian_South_ITS7.HO population.

In summary, this analysis aims to determine which ancient or historical populations contributed to the present-day genetic makeup of a Southern Italian population or individual. The results suggest that the "C6_Italy_Imperial_Central_Med.SG" population might be the primary contributor, but more context (and possibly additional analyses) is needed for a definitive conclusion.

eupator · Aug 13, 2023

Jovialis said:
So I've been playing around in ADMIXTOOLS2

Nice work!

Looking forward to further discourse as more people get the hang of it.

Jovialis · Aug 13, 2023

eupator said:
Nice work!

Looking forward to further discourse as more people get the hang of it.

Thank you so much for providing the tutorial, it was very helpful! I think more people should use this as well. It is even more convenient and fun to use than vahaduo.

I also used chatgpt to help resolve some issues too, it has really shown to be a powerful tool. It also helped me clean up some script and quickly identify errors. I also like how it helps me understand what the results mean.

Idontknowwhatimdoing · Aug 13, 2023

Jovialis said:
Here is an analysis of all the genetic cohorts from the Imperial era compared to Italian_South_ITS7.HO

Code:

> results$weights # A tibble: 4 × 5 target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Italian_South_ITS7.HO C7_Italy_Imperial_oCentralEuropean.SG -0.628 1.39 -0.452 2 Italian_South_ITS7.HO C6_Italy_Imperial_Central_Med.SG 3.82 4.97 0.769 3 Italian_South_ITS7.HO C5_Italy_Imperial_East_Med.SG -2.49 3.16 -0.786 4 Italian_South_ITS7.HO C4_Italy_Imperial_Near_East.SG 0.293 0.994 0.295 > results$popdrop # A tibble: 15 × 15 pat wt dof chisq p f4rank C7_Italy_Imperial_oCentral…¹ C6_Italy_Imperial_Ce…² C5_Italy_Imperial_Ea…³ C4_Italy_Imperial_Ne…⁴ feasible best <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> 1 0000 0 17 7.95 9.68e- 1 3 -0.628 3.82 -2.49 0.293 FALSE NA 2 0001 1 18 8.90 9.62e- 1 2 -1.32 6.35 -4.03 NA FALSE TRUE 3 0010 1 18 17.7 4.73e- 1 2 -56.5 139. NA -81.9 FALSE TRUE 4 0100 1 18 27.1 7.65e- 2 2 0.736 NA -1.69 1.95 FALSE TRUE 5 1000 1 18 12.2 8.38e- 1 2 NA 2.07 -2.03 0.958 FALSE TRUE 6 0011 2 19 68.0 1.99e- 7 1 -0.0961 1.10 NA NA FALSE NA 7 0101 2 19 110. 9.22e-15 1 0.354 NA 0.646 NA TRUE NA 8 0110 2 19 62.9 1.34e- 6 1 0.433 NA NA 0.567 TRUE NA 9 1001 2 19 38.2 5.53e- 3 1 NA 1.86 -0.864 NA FALSE NA 10 1010 2 19 62.9 1.31e- 6 1 NA 1.12 NA -0.121 FALSE NA 11 1100 2 19 69.6 1.06e- 7 1 NA NA 2.70 -1.70 FALSE NA 12 0111 3 20 196. 6.28e-31 0 1 NA NA NA TRUE NA 13 1011 3 20 77.4 1.07e- 8 0 NA 1 NA NA TRUE NA 14 1101 3 20 239. 1.89e-39 0 NA NA 1 NA TRUE NA 15 1110 3 20 396. 1.45e-71 0 NA NA NA 1 TRUE NA # ℹ abbreviated names: ¹C7_Italy_Imperial_oCentralEuropean.SG, ²C6_Italy_Imperial_Central_Med.SG, ³C5_Italy_Imperial_East_Med.SG, # ⁴C4_Italy_Imperial_Near_East.SG # ℹ 3 more variables: dofdiff <dbl>, chisqdiff <dbl>, p_nested <dbl> > >

ChatGPT:

This output appears to be from a statistical analysis, most likely related to genetic ancestry or population genetics. The analysis seems to be focused on determining the contributions of various ancient or historical populations (i.e., sources) to the present-day genetic makeup of "Italian_South_ITS7.HO" (likely a modern Southern Italian population or individual sample). Here's what I interpret from the provided output:

1. **results$weights**:
- This table shows the estimated weight (contribution) of each ancient source to the target (Italian_South_ITS7.HO).
- "weight" indicates the estimated percentage contribution from each source, though it's worth noting that some weights are negative, which typically isn't meaningful in a genetic context and might indicate issues with the model or the populations being tested.
- "se" is the standard error associated with each weight, which provides an idea of the uncertainty around the weight estimates.
- "z" is a z-score, calculated as the weight divided by the standard error. This gives an idea of how many standard deviations the estimate is from zero, which can be used to assess significance.

2. **results$popdrop**:
- This table seems to be from a model testing different combinations of source populations to see how well they fit the target population's genetic data.
- "pat" indicates which source populations are included in the model (1 for included, 0 for excluded).
- "wt" is the number of sources included in the model.
- "dof" stands for degrees of freedom, which is a measure used in statistical tests.
- "chisq" is the chi-squared statistic, which measures the goodness of fit of the model. A higher value usually indicates a worse fit.
- "p" is the p-value associated with the chi-squared statistic. Smaller p-values suggest that the model is a significant improvement over a null model.
- "f4rank" is likely a rank statistic based on the f4-statistic, commonly used in population genetics.
- The columns with abbreviated names show the estimated contributions (weights) of each source population for that particular model.
- "feasible" indicates if the model is feasible (i.e., if the weights make sense and there are no issues like negative contributions).
- "best" might be indicating the best-fitting models based on some criteria.

**Interpretation**:

From the data:
- The first ancient source (C7_Italy_Imperial_oCentralEuropean.SG) has a negative weight, which is not typically meaningful in this context.
- The second source (C6_Italy_Imperial_Central_Med.SG) has the highest positive weight, suggesting it might be the most significant contributor to the modern Italian_South_ITS7.HO population.
- Several models with different combinations of sources have been tested. Some are feasible, while others aren't.
- The models with the lowest p-values (especially those close to zero) suggest that those combinations of sources are the most likely contributors to the Italian_South_ITS7.HO population.

In summary, this analysis aims to determine which ancient or historical populations contributed to the present-day genetic makeup of a Southern Italian population or individual. The results suggest that the "C6_Italy_Imperial_Central_Med.SG" population might be the primary contributor, but more context (and possibly additional analyses) is needed for a definitive conclusion.

Nothing is meaningful here because of how high the standard errors are, ignore models with standard errors that high.

Archetype0ne · Aug 13, 2023

Well done in getting the tool running Jov! Its quite a feat on its own, considering the entry barrier (trust me it gave me a headache or two).
Now its all about running countless models and figuring out how to build the tails. As IDKWIDoing said you want the SE < 0.05. And also, probably negative weights are not a good sign for the model. Rule of thumb when using non basic components that might share some drift (IE BA/IA samples, instead of EHG, CHG, EEF, two components models are the best you can hope for, three component models can yield successful runs with a lot of effort and a bit of luck, four component models almost impossible).

But yeah, building the tail now, is a skill all on its own, there is a paper that mentions on how to build a good one depending on the hypothesis you are testing. https://www.biorxiv.org/content/biorxiv/early/2020/04/10/2020.04.09.032664/DC1/embed/media-1.pdf (Page 16-17, very informative on the topic).

Idontknowwhatimdoing · Aug 13, 2023

Jovialis said:
Certainly,

I relabeled them according to this:

https://www.eupedia.com/forum/threa...lators)/page14?p=625783&viewfull=1#post625783

The C6 samples seem all over the place, one is even 50% Levantine + Germanic like, and coincidentally plots it with Italians. Its not a real population cluster, these "studies" seem to simplify everything into irrelevant PCA's and mislead people. C6 are just a bunch of mixed saples that coincidentally plot with modern South Italians while others are Greek + Italic like modern South Italians

Idontknowwhatimdoing · Aug 13, 2023

I made a better Greek Imperial outlier, the previous one had Italic-Germanic and other mixes.

Jovialis · Aug 13, 2023

Those are cohorts characterized by haplotype sharing, it is also how they discern Eastern Med, from Central Med in the study.

The qpADM models seem to prefer Central Med over East Med as the highest weight when comparing all of them.

But also, isn't the model with the 50% levant/Germanic a G25 model with 0.25x? I thought that was the means that just provides models based on coincident on the PCA?

Idontknowwhatimdoing · Aug 13, 2023

Jovialis said:
Those are cohorts characterized by haplotype sharing, it is also how they discern Eastern Med, from Central Med in the study.

The qpADM models seem to prefer Central Med over East Med as the highest weight when comparing all of them.

But also, isn't the model with the 50% levant/Germanic a G25 model with 0.25x? I thought that was the means that just provides models based on coincident on the PCA?

Whatever the study did, was simplified irrelevant stuff. These people have no proper critical thinking when it comes to population genetics.

0.25x means distance control i think. It increases the distance cap so the model doesnt overfit.

Jovialis · Aug 13, 2023

I have to disagree, I think it is significant that they are haplotypes sharing, and form a cluster based on it. As well as can be modeled as France_Bell_Beaker+Minoan Lasshiti, I see the fit slightly improves with Moroccan_EN when modeling Italian_South.HO by contributiing up to around 2%. But the model fails when including the Sidon MBA samples. I will run it later and show you.

Edit: actually it seems like it doesn't improve the arguments against the null hypothesis; leaving Morocco_EN is optimal in this model. Thank god for ChatGPT!

Jovialis · Aug 13, 2023

Jovialis · Aug 13, 2023

Sicilians with the same model.

Jovialis · Aug 13, 2023

Idontknowwhatimdoing · Aug 13, 2023

Jovialis said:

You are not supposed to P value rank on qpAdm, keep that in mind. All these mdels are rejected models, which means that they need something else. the p value here is 0.0000000114 when the p value required for a plausible model is 0.05/

Jovialis · Aug 13, 2023

Yes I think that is right, to get a more perfect score, something else would be needed. Nevertheless, I think this is a good baseline to work off of.

Jovialis · Aug 13, 2023

According to the explanation of the score, despite not being 0.05. The score indicates that it rejects the null hypothesis, meaning it rejects the argument against using the model. It just doesn't necessarily prove it definitively. But it is a possibility.

Jovialis · Aug 13, 2023

Idontknowwhatimdoing said:
You are not supposed to P value rank on qpAdm, keep that in mind. All these mdels are rejected models, which means that they need something else. the p value here is 0.0000000114 when the p value required for a plausible model is 0.05/

The algorithm seems to rank them automatically with F4 rank column.

Jovialis · Aug 13, 2023

Archetype0ne said:
Well done in getting the tool running Jov! Its quite a feat on its own, considering the entry barrier (trust me it gave me a headache or two).
Now its all about running countless models and figuring out how to build the tails. As IDKWIDoing said you want the SE < 0.05. And also, probably negative weights are not a good sign for the model. Rule of thumb when using non basic components that might share some drift (IE BA/IA samples, instead of EHG, CHG, EEF, two components models are the best you can hope for, three component models can yield successful runs with a lot of effort and a bit of luck, four component models almost impossible).

But yeah, building the tail now, is a skill all on its own, there is a paper that mentions on how to build a good one depending on the hypothesis you are testing. https://www.biorxiv.org/content/biorxiv/early/2020/04/10/2020.04.09.032664/DC1/embed/media-1.pdf (Page 16-17, very informative on the topic).

Thanks! indeed, getting it set up was a bit tricky. But after a while I started becoming more comfortable with using Rstudio.

I also have to say, chatgpt, particularly the code interpreter has been super helpful in fixing my prompts, and explaining errors. I look forward to delving into this further.

Comparing Ancient Greek populations to modern Greeks and Italians

Advisor

Advisor

Advisor

destroyer of delusions

Advisor

Fledgling

Regular Member

Fledgling

Fledgling

Advisor

Fledgling

Advisor

Advisor

Advisor

Advisor

Fledgling

Advisor

Advisor

Advisor

Advisor