Southern Italian Ethnogenesis (My theory)

I would like to make this thread to articulate my theory on the ethnogenesis of southern Italians. It is something I talk about often, and I'd like this thread to be a point of reference. Rather than me re-iterating my theory ad nauseum.

Like Ancient Greeks, Southern Italians can be modeled as a Minoan/Steppe admixture. Recent studies show a strong affinity to Ancient Greeks in Southern Italians (Sarno et al. 2021 & Raveane et al. 2022). Moreover, Raveane et al 2022 even uses Minoan as an ultimate source population to model Puglia. In Lazaridis et al. 2017, the Minoan/Steppe admixture model was known as the "Northern Model" to explain the ethnogenesis of Mycenaeans. Clemente et al 2021 also implicitly uses the Northern Model for Helladics:

t1OdfWFl.png

(Source: Lazaridis lecture graphic)

F2J8PjY.jpg

(Source: Clemente et al. 2021)

epI5kEs.png

(Source: Raveane et al. 2022)

rC3hiavl.png

(Source: Lazaridis lecture graphic)


Some critics in the past think this model neglects to incorporate the Eastern Mediterranean influence that arrived in the in later period. It does not, in fact, my analysis can show that both may be true, as the Anatolia_BA-cline in South Italy demonstrates. For people obsessed with finding Levantine in South Italy (sadly for nefarious reasons) that could be also explained by the component Anatolia_BA is modeled as 5% "Levantine Farmer". Nevertheless, to their dismay, it is an exceedingly small percentage overall in southern Italian autosomal admixture:

Rol2SxG.png



The Anatolia_BA component could be partly attributed by Aegean Islanders in Greek Colonies. We see that modern Aegean Islanders can be modeled with Anatolia_BA, some almost completely. Eastern Mediterranean found in the Roman Imperial era are genetically similar to the modern Aegean Islanders. Thus, it is possible throughout the Iron Age-Imperial Age there could have been ample opportunity for admixture with this Anatolian_BA heavy group to occur. It should be noted that the R850 Latin sample is also very similar to modern Aegean Islanders.

Modern Mainland Greeks & Aegean Islanders:
aRRsEaN.png


Eastern Mediterranean Imperial Age Romans:
fzyXPiK.png


Magna Graecia:

1QR7Iw2.png


Going back to the chart that shows the Anatolia_BA-cline in modern Southern Italians, it demonstrates that Anatolia_BA is found throughout the whole south. But an important caveat is the degree of which it is present in individual samples. Anatolia_BA could be minimal admixture to about 50%, as well as some showing none at all. I speculate the reason may be Southern Italian towns were isolated from one another for a variety of reasons. Some towns were re-founded, and perhaps more of one particular ancestry may have been present in that particular town. Once the region was united, it allowed this sporadic signal to be created to a degree. I don't think you can say south Italy can be modeled in just one way as a whole.

There are also other admixture events that had some impact I am sure, such as the Moors, Saracens, Normans, etc. For the Moors, I think that could explain higher Iberomarusian in some modern samples. However, that is also hard to decern considering it shows up in ancient Italians as well:

ehx19SR.png
I'd like to say that I very strongly agree with your points. I have been saying for some time that southern Italians are better modeled with Greek and Anatolian ancestry than anything levantine or north african and this accurately reflects the history of the region as can be seen by the massive Greek cities, which hosted huge populations during the Iron age and Roman era. The high level of relative Neolithic S. Caucasian ancestry found in bronze and Iron age Anatolians serves as a reliable source of the same type of Caucasian ancestry found in modern Greek islanders and Southern Italians.
 
I have taken the Euorpe_EN pop from Raveane et al. 2022, and introduced it into the model and outgroups for Sarno et al. 2021

Then I have divided the Italian_South.HO samples into their 4 distinct regions, since the model kept running into issue. It seems that it works a lot better when you look at them individually. This is exactly what Ravenane and Sarno did with different populations.

It seems that Italian_South.HO as a whole seems to miss the mark as a whole for analysis, despite being composed of similar components.

uO8n6sR.png

Code:
> results$weights
# A tibble: 3 × 5
  target   left        weight     se     z
  <chr>    <chr>        <dbl>  <dbl> <dbl>
1 Jovialis Steppe_EMBA  0.274 0.0524  5.23
2 Jovialis Europe_EN    0.580 0.0384 15.1
3 Jovialis CHG_Iran_N   0.146 0.0669  2.19
> results$popdrop
# A tibble: 7 × 14
  pat      wt   dof   chisq         p f4rank Steppe_EMBA Europe_EN CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
  <chr> <dbl> <dbl>   <dbl>     <dbl>  <dbl>       <dbl>     <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
1 000       0    10    6.48 7.73e-  1      2       0.274     0.580      0.146 TRUE     NA         NA       NA        NA
2 001       1    11   26.9  4.78e-  3      1       0.356     0.644     NA     TRUE     TRUE        0     -380.        1
3 010       1    11  407.   1.97e- 80      1      -0.170    NA          1.17  FALSE    TRUE        0      299.        0
4 100       1    11  108.   3.76e- 18      1      NA         0.613      0.387 TRUE     TRUE       NA       NA        NA
5 011       2    12 1690.   0              0       1        NA         NA     TRUE     NA         NA       NA        NA
6 101       2    12  358.   2.50e- 69      0      NA         1         NA     TRUE     NA         NA       NA        NA
7 110       2    12  686.   4.20e-139      0      NA        NA          1     TRUE     NA         NA       NA        NA


> results$weights
# A tibble: 3 × 5
  target              left        weight     se     z
  <chr>               <chr>        <dbl>  <dbl> <dbl>
1 Italian_South_BEL57 Steppe_EMBA  0.170 0.0498  3.42
2 Italian_South_BEL57 Europe_EN    0.544 0.0370 14.7
3 Italian_South_BEL57 CHG_Iran_N   0.286 0.0663  4.31
> results$popdrop
# A tibble: 7 × 14
  pat      wt   dof  chisq         p f4rank Steppe_EMBA Europe_EN CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
  <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>     <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
1 000       0    10   10.8 3.77e-  1      2       0.170     0.544      0.286 TRUE     NA         NA       NA        NA
2 001       1    11   88.7 2.98e- 14      1       0.331     0.669     NA     TRUE     TRUE        0     -327.        1
3 010       1    11  416.  2.98e- 82      1      -0.234    NA          1.23  FALSE    TRUE        0      333.        0
4 100       1    11   82.2 5.57e- 13      1      NA         0.553      0.447 TRUE     TRUE       NA       NA        NA
5 011       2    12 2016.  0              0       1        NA         NA     TRUE     NA         NA       NA        NA
6 101       2    12  427.  7.30e- 84      0      NA         1         NA     TRUE     NA         NA       NA        NA
7 110       2    12  745.  1.21e-151      0      NA        NA          1     TRUE     NA         NA       NA        NA


> results$weights
# A tibble: 3 × 5
  target             left        weight     se     z
  <chr>              <chr>        <dbl>  <dbl> <dbl>
1 Italian_South_ITS4 Steppe_EMBA  0.156 0.0466  3.35
2 Italian_South_ITS4 Europe_EN    0.631 0.0377 16.7
3 Italian_South_ITS4 CHG_Iran_N   0.212 0.0622  3.42
> results$popdrop
# A tibble: 7 × 14
  pat      wt   dof   chisq         p f4rank Steppe_EMBA Europe_EN CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
  <chr> <dbl> <dbl>   <dbl>     <dbl>  <dbl>       <dbl>     <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
1 000       0    10    9.33 5.01e-  1      2       0.156     0.631      0.212 TRUE     NA         NA       NA        NA
2 001       1    11   53.3  1.60e-  7      1       0.274     0.726     NA     TRUE     TRUE        0     -465.        1
3 010       1    11  518.   4.55e-104      1      -0.384    NA          1.38  FALSE    TRUE        0      454.        0
4 100       1    11   64.5  1.35e-  9      1      NA         0.645      0.355 TRUE     TRUE       NA       NA        NA
5 011       2    12 2412.   0              0       1        NA         NA     TRUE     NA         NA       NA        NA
6 101       2    12  291.   3.71e- 55      0      NA         1         NA     TRUE     NA         NA       NA        NA
7 110       2    12 1003.   3.49e-207      0      NA        NA          1     TRUE     NA         NA       NA        NA


> results$weights
# A tibble: 3 × 5
  target             left        weight     se     z
  <chr>              <chr>        <dbl>  <dbl> <dbl>
1 Italian_South_ITS5 Steppe_EMBA  0.190 0.0477  3.97
2 Italian_South_ITS5 Europe_EN    0.633 0.0382 16.6
3 Italian_South_ITS5 CHG_Iran_N   0.177 0.0602  2.94
> results$popdrop
# A tibble: 7 × 14
  pat      wt   dof  chisq         p f4rank Steppe_EMBA Europe_EN CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
  <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>     <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
1 000       0    10   15.4 1.18e-  1      2       0.190     0.633      0.177 TRUE     NA         NA       NA        NA
2 001       1    11   60.9 6.33e-  9      1       0.289     0.711     NA     TRUE     TRUE        0     -463.        1
3 010       1    11  524.  2.13e-105      1      -0.291    NA          1.29  FALSE    TRUE        0      424.        0
4 100       1    11  100.  1.52e- 16      1      NA         0.661      0.339 TRUE     TRUE       NA       NA        NA
5 011       2    12 2314.  0              0       1        NA         NA     TRUE     NA         NA       NA        NA
6 101       2    12  306.  2.25e- 58      0      NA         1         NA     TRUE     NA         NA       NA        NA
7 110       2    12  979.  5.44e-202      0      NA        NA          1     TRUE     NA         NA       NA        NA


> results$weights
# A tibble: 3 × 5
  target             left        weight     se     z
  <chr>              <chr>        <dbl>  <dbl> <dbl>
1 Italian_South_ITS7 Steppe_EMBA  0.233 0.0481  4.84
2 Italian_South_ITS7 Europe_EN    0.561 0.0361 15.5
3 Italian_South_ITS7 CHG_Iran_N   0.207 0.0641  3.22
> results$popdrop
# A tibble: 7 × 14
  pat      wt   dof  chisq         p f4rank Steppe_EMBA Europe_EN CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
  <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>     <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
1 000       0    10   17.5 6.44e-  2      2       0.233     0.561      0.207 TRUE     NA         NA       NA        NA
2 001       1    11   78.8 2.55e- 12      1       0.346     0.654     NA     TRUE     TRUE        0     -390.        1
3 010       1    11  469.  1.35e- 93      1      -0.187    NA          1.19  FALSE    TRUE        0      336.        0
4 100       1    11  133.  4.93e- 23      1      NA         0.575      0.425 TRUE     TRUE       NA       NA        NA
5 011       2    12 1971.  0              0       1        NA         NA     TRUE     NA         NA       NA        NA
6 101       2    12  447.  4.33e- 88      0      NA         1         NA     TRUE     NA         NA       NA        NA
7 110       2    12  812.  3.58e-166      0      NA        NA          1     TRUE     NA         NA       NA        NA

FAM used:

 
Bel57 looks quite different from the other clusters. To wich regions does it correspond?
 
I think IT7 is also from Calabria if I am not mistaken. I could be wrong. I believe it was used in Lazaridis et al. 2016.

This southern Italian sample set was already being used in old studies, and I remember, but I cannot check if I remember correctly, that if BEL57 came from Belvedere in Calabria, the other 3 were from Campania, Apulia and Basilicata. I remember finding in the supp info of some old study that specified that. It's just a memory though, I can't find that document anymore at this moment.
 
This southern Italian sample set was already being used in old studies, and I remember, but I cannot check if I remember correctly, that if BEL57 came from Belvedere in Calabria, the other 3 were from Campania, Apulia and Basilicata. I remember finding in the supp info of some old study that specified that. It's just a memory though, I can't find that document anymore at this moment.
Yes! I think you are right.
 
I wish the Reich Lab would have a more substantive Italian_South sample set.

Even if we had access to more, just adding them to the current dataset is a complex task. It was a herculean task just to merely add my own sample to the database.
 
Here it is with Italian_North.HO and TSI.DG

They require WHG for the model to work with Steppe_EMBA, Europe_EN, and CHG_Iran_N

onCfqHPh.png


qpAdm outputs for additions:
Code:
> results$weights
# A tibble: 4 × 5
  target left        weight      se     z
  <chr>  <chr>        <dbl>   <dbl> <dbl>
1 TSI.DG Steppe_EMBA 0.253  0.0249  10.2
2 TSI.DG Europe_EN   0.582  0.0130  44.7
3 TSI.DG CHG_Iran_N  0.137  0.0262   5.23
4 TSI.DG WHG         0.0269 0.00913  2.95
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank Steppe_EMBA Europe_EN CHG_Iran_N     WHG feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>     <dbl>      <dbl>   <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     8   12.0 1.53e-  1      3       0.253     0.582      0.137  0.0269 TRUE     NA         NA      NA         NA
 2 0001      1     9   21.5 1.06e-  2      2       0.297     0.592      0.111 NA      TRUE     TRUE        0     -19.6        1
 3 0010      1     9   41.1 4.81e-  6      2       0.360     0.629     NA      0.0103 TRUE     TRUE        0    -477.         1
 4 0100      1     9  518.  9.75e-106      2      -1.21     NA          1.93   0.274  FALSE    TRUE        0     409.         0
 5 1000      1     9  108.  3.45e- 19      2      NA         0.559      0.364  0.0769 TRUE     TRUE       NA      NA         NA
 6 0011      2    10   43.3 4.35e-  6      1       0.370     0.630     NA     NA      TRUE     NA         NA      NA         NA
 7 0101      2    10  652.  1.47e-133      1      -0.771    NA          1.77  NA      FALSE    NA         NA      NA         NA
 8 0110      2    10 1541.  0              1       1.85     NA         NA     -0.851  FALSE    NA         NA      NA         NA
 9 1001      2    10  216.  6.58e- 41      1      NA         0.571      0.429 NA      TRUE     NA         NA      NA         NA
10 1010      2    10  464.  2.09e- 93      1      NA         0.871     NA      0.129  TRUE     NA         NA      NA         NA
11 1100      2    10  939.  2.39e-195      1      NA        NA          1.02  -0.0167 FALSE    NA         NA      NA         NA
12 0111      3    11 3032.  0              0       1        NA         NA     NA      TRUE     NA         NA      NA         NA
13 1011      3    11  642.  1.59e-130      0      NA         1         NA     NA      TRUE     NA         NA      NA         NA
14 1101      3    11  991.  1.62e-205      0      NA        NA          1     NA      TRUE     NA         NA      NA         NA
15 1110      3    11 3186.  0              0      NA        NA         NA      1      TRUE     NA         NA      NA         NA


> results$weights
# A tibble: 4 × 5
  target           left        weight     se     z
  <chr>            <chr>        <dbl>  <dbl> <dbl>
1 Italian_North.HO Steppe_EMBA 0.295  0.0301  9.82
2 Italian_North.HO Europe_EN   0.600  0.0150 39.9
3 Italian_North.HO CHG_Iran_N  0.0797 0.0300  2.65
4 Italian_North.HO WHG         0.0250 0.0113  2.21
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank Steppe_EMBA Europe_EN CHG_Iran_N      WHG feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>     <dbl>      <dbl>    <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     8   10.0 2.62e-  1      3       0.295     0.600     0.0797  0.0250  TRUE     NA         NA     NA          NA
 2 0001      1     9   15.3 8.39e-  2      2       0.338     0.607     0.0545 NA       TRUE     TRUE        0     -3.19        1
 3 0010      1     9   18.5 3.02e-  2      2       0.361     0.626    NA       0.0133  TRUE     TRUE        0   -530.          1
 4 0100      1     9  548.  2.85e-112      2      -1.26     NA         1.96    0.299   FALSE    TRUE        0    431.          0
 5 1000      1     9  117.  4.68e- 21      2      NA         0.587     0.323   0.0899  TRUE     TRUE       NA     NA          NA
 6 0011      2    10   20.8 2.25e-  2      1       0.374     0.626    NA      NA       TRUE     NA         NA     NA          NA
 7 0101      2    10  682.  4.47e-140      1      -0.751    NA         1.75   NA       FALSE    NA         NA     NA          NA
 8 0110      2    10 1546.  0              1       1.86     NA        NA      -0.862   FALSE    NA         NA     NA          NA
 9 1001      2    10  236.  3.77e- 45      1      NA         0.595     0.405  NA       TRUE     NA         NA     NA          NA
10 1010      2    10  356.  1.90e- 70      1      NA         0.850    NA       0.150   TRUE     NA         NA     NA          NA
11 1100      2    10 1005.  1.63e-209      1      NA        NA         1.01   -0.00746 FALSE    NA         NA     NA          NA
12 0111      3    11 2843.  0              0       1        NA        NA      NA       TRUE     NA         NA     NA          NA
13 1011      3    11  556.  2.89e-112      0      NA         1        NA      NA       TRUE     NA         NA     NA          NA
14 1101      3    11 1046.  2.41e-217      0      NA        NA         1      NA       TRUE     NA         NA     NA          NA
15 1110      3    11 3134.  0              0      NA        NA        NA       1       TRUE     NA         NA     NA          NA

Also, this was the prompt I used:

Code:
# Define paths for dataset
prefix = "D:\\Bioinformatics\\01_Admixtools_Dataset\\v54.1.p1_HO_Jovialis_Plink\\v54.1.p1_HO_Jovialis"
my_f2_dir = "D:\\Bioinformatics\\my_f2_dir_Jovialis"


# Load necessary libraries
library(admixtools)
library(tidyverse)


# Define populations
target = c('Jovialis') # 'Italian_South_BEL57', 'Italian_South_ITS4', 'Italian_South_ITS5', 'Italian_South_ITS7', 'TSI.DG', 'Italian_North.HO'
left = c('Steppe_EMBA', 'Europe_EN', 'CHG_Iran_N', 'WHG')


# Right list
right = c('Anatolia_N', 'Ust_Ishim', 'Kostenki14', 'MA1_HG', 'Goyet', 'ElMiron', 'Vestonice16', 'Villabruna', 'EHG', 'Levant_N', 'Natufian', 'Mota')


# Generate f2 stats
mypops = c(right, target, left)
extract_f2(prefix, my_f2_dir, pops = mypops, overwrite = TRUE, maxmiss = 1)
f2_blocks = f2_from_precomp(my_f2_dir, pops = mypops, afprod = TRUE)


# Run the model
results = qpadm(prefix, left, right, target, allsnps = TRUE)
results$weights
results$popdrop
 
ChatGPT 4.0 assessment of all the the qpAdm outputs:

The given qpAdm outputs represent a quantitative analysis aimed at unraveling the ancestral composition of various population samples. qpAdm is a statistical tool used in population genetics to model the ancestry of test populations as a mixture of several source populations. The primary samples under consideration are 'Jovialis', 'Italian_South_BEL57', 'Italian_South_ITS4', 'Italian_South_ITS5', 'Italian_South_ITS7', 'TSI.DG', and 'Italian_North.HO'. These samples are modeled using source populations 'Steppe_EMBA', 'Europe_EN', 'CHG_Iran_N', and in some cases, 'WHG'.

### Ancestral Weights:
The ancestral weights for each target population are given in the 'results$weights' tibble. The weights represent the proportions of ancestry derived from the corresponding source populations. The 'se' column signifies the standard error associated with the estimated weight, and 'z' indicates the z-score, reflecting the significance of the weight.

#### Observations:
- Across all target populations, 'Europe_EN' consistently has the highest weight, signifying a dominant European Early Neolithic ancestry component.
- 'Steppe_EMBA' and 'CHG_Iran_N' also contribute to the ancestry, but to varying degrees across different target populations.
- 'WHG' (Western Hunter-Gatherers) contributes minimally to 'TSI.DG' and 'Italian_North.HO'.

### Model Feasibility and Model Selection:
The 'results$popdrop' tibble provides details on model feasibility, model selection, and nested hypothesis testing. The 'feasible' column indicates whether a model is statistically feasible, while the 'best' column marks the most plausible model.

#### Observations:
- The simplest models (pat = '000', '0000') are feasible for all target populations, indicating that the inclusion of all source populations in the model is statistically supported.
- For each target, more complex models, where one or more source populations are dropped (indicated by '1' in the 'pat' column), are also tested. Some of these models are feasible, but they are not selected as the best models, suggesting that the simplest models with all source populations are the most supported by the data.
- The 'chisq' and 'p' columns provide information on the goodness-of-fit of the models. Lower 'chisq' values and higher 'p' values indicate better fit.

### Summary:
The analysis reveals a complex ancestral composition for the target populations, with a predominant contribution from the 'Europe_EN' source population. The variability in weights across different targets suggests diverse ancestral influences. The simplest models including all source populations are consistently supported, indicating the necessity of multiple source populations to accurately model the ancestry of the target populations.

This assessment provides insights into the ancestral makeup of the studied populations and underscores the importance of considering multiple ancestral sources in population genetic studies. Further research may delve deeper into the historical and geographical contexts of these populations to elucidate the migration patterns and interactions that have shaped their genetic landscape.
 
xswyjDB.png


Only ITS5 had a sub-optimal p-vaule, but the other metrics looked good.

Here are modern Italian populations using C_Italian_N and C_Italian_ChL, along with Steppe_EMBA, and Iran_N_CHG.

Since TSI and the North can be modeled with WHG, it works with the ChL central Italian sample since there was a WHG resurgence there. I suspect this resurgence did not come to the south, and instead was enriched with extra non-steppe related CHG/IN. I look forward to using the Calabrian_N for the south.


 
3MMQb8h.png


I'll make a thread and post the raw results from Admixtool2, with further explanation when I get a chance tomorrow night. But it is 2:45 am right now, and I've got things to do in the morning.
 
From Lazaridis et al. 2016 supplemental information excel:

ITS2FItalian_SouthItalian_SouthItalyNaples
ITS4FItalian_SouthItalian_SouthItalyNaples
ITS5FItalian_SouthItalian_SouthItalySalerno
ITS7FItalian_SouthItalian_SouthItalyCrispiano
 

This thread has been viewed 69667 times.

Back
Top