1. Elevated modern frequency does not equal place of origin
I have read hundreds of time people thinking that a haplogroup must probably have originated where it is most common today. It is an assumption that even professional geneticists make, and that is nevertheless often mistaken.
One famous example is Y-haplogroup R1b. Up until recently most people, amateurs and professionals alike, thought that it must be native to western Europe because it is where it is found at the highest frequencies. The Genographic Project still hasn't changed its description of it. It reads : "30,000 years ago, a descendant of the clan making its way into Europe gave rise to marker M343, then defining marker of this haplogroup. These people dominated the human expansion into Europe, the Cro-Magnons."
There are plenty of other examples. Sometimes the place of highest frequency does coincide with the region of origin. This is usually true of subclades that have developed in an isolated region, or of relatively recent mutations. The first rule is : the older the haplogroup the less likely its place of origin will coincide with the place of highest frequency.
Y-haplogroup Q is found mostly in Siberia (Altai region) and among native Americans. Judging from percentages alone it would be easy to jump to the wrong conclusion that it originated in the pre-Colombian Americas. This is actually a caricatural example as everybody knows that America was the last continent settled by humans. But the mistake made by National Geographic and plenty of others regarding R1b is just the same. That's why it is vital to look at the age of subclades and identify where the oldest version is found. In this case, Q*, the oldest form of Q, is found in Central Asia and the Middle East. This is unfortunately to wide an area to pinpoint a place of origin.
This leads to my second rule : Paleolithic people were not sedentary but moved all the time, making it pointless to try to define a small geographic region as a Paleolithic haplogroup's place of origin.
It is often in isolated regions with small population density that older versions of haplogroups survive. Apart from Q*, Central Asia is a great region for preserving haplogroups that have disappeared about anywhere else (e.g. O*, P*, K*). The Caucasus is another good example. Its high mountains have isolated ethnic groups from one another for millennia. Unsurprisingly it is almost the only region where haplogroup F, a haplogroup that originated some 50,000 years ago, can still be found in high, and indeed sometimes very high frequencies. Isolation and the near absence of population inflow from the outside are one factor, but the small size of mountain populations is another, slowing down the mutation rate that create new haplogroups. It is not because F is found almost exclusively in the Caucasus nowadays that it originated there. I just survived there in its oldest form, but evolved elsewhere. 90% of the world population descends from F*. That's why it is important not to confuse modern distribution and place of origin. F almost certainly did not originate in the Caucasus, otherwise it would have remained stuck there rather than spreading to the whole world.
It is banal to refer to the southern Arabian peninsula as the place of origin of J1. After all, over 70% of men in Yemen belong to that haplogroup. Things are never as simple as they look at first sight. We have to ask : why did this haplogroup become dominant in that region and not another one also found there ? It could have evolved from a small group of original settlers into a virgin region. If the other haplogroups represent later immigrants, the first haplogroup present would have remained the dominant one, unless the new migrants came in huge numbers or killed the original inhabitants.
But how can we know if J1 in Yemen arrived first in an empty place or if it replaced indigenous haplogroups ? R1b did replace most of the older haplogroups in western Europe and so did O in East Asia (the aboriginal East Asians belonging to C and D).
Even if Yemen was uninhabited before J1 or all the older lineages became extinguished, how can we know if the people who arrive were already J1 or were J* that later developed into J1 in Yemen, then re-expanded northward ? To answer this question geneticists will usually analyse the genetic diversity of various regions. Theoretically, the place where J1 originated would be the one where the most subclades and STR variance can be observed. As we will now see the theory always seem easier than it actually is in practice.
2. Genetic diversity does not equal place of origin
Another common mistake is to think that a haplogroup's place of origin corresponds to the area where it has the greater genetic diversity (e.g. microsatellite diversity, number of subclades). For example, if region A has 10 different subclades for a haplogroup, with a converging age going back 15,000 years, but region B has only 2 subclades and a TMRCA going back only 8,000 years, then region A is more likely to be the place of origin. The concept is attractive, but unfortunately too simple. It doesn't take into account two essential factors : 1) the population size of a region and 2) the region's history (invasions, migrations, genocides).
First, one has to consider the historical and present population size of regions studied. Mutations, and therefore genetic diversity, happen 100 times more frequently in a population of one million individuals than among 10,000 persons. Take 200 men belonging to the same Y-haplogroup, divide them in two groups, one in an unfriendly environment with little food (e.g. Siberia) and the other in a pleasant climate favourable to agriculture (e.g. India). Let's say that after a thousand years the first group will have 1,000 descendants carrying the same haplogroup, while the second will have 100,000 descendants. Population grow has be constant with no major war, famine or epidemics causing a population bottleneck in between. In this theoretical example (because wars, famines and epidemics do happen) it is easy to see why the second group should have a much greater genetic diversity than the first one after one thousand years, although they both descend from the same original lineage !
This is probably what happened with such haplogroups as R1a1a (Y-DNA) and U2 (mtDNA) in India, as opposed to their likely place of origin in the Eurasian steppe. The same thing happened with haplogroup O, which most probably originated in Central Asia, but gained a great diversity in the much more fertile lands of East and South-East Asia.
The second fundamental point that should never be overlooked is regional history. How often was a region invaded ? Was it settled just once like in Iceland or Polynesia, or was it constantly overrun by nomadic neighbours like in the Balkans, the Middle East, or northern China ? Did people migrate in mass to other places like Germanic tribes in the 5th century, or was it a place where people came to settle like Italy or Anatolia ? Did invaders massacre the locals that they saw as inferior, or did they just take over power as new rulers of a well-established kingdom/empire that they regarded as superior to their own culture ? Did some kind of apartheid happen between more developed newcomers and more primitive indigenes, as was probably the case in Europe between Mesolithic hunter-gatherers and Neolithic farmers/herders ?
These are all essential questions that one should ask when studying population genetics. Unfortunately they are typically the least considered elements by professional geneticists, who tend to have a very poor background in history and archaeology.
Regions that were seen as attractive by nomads to plunder, conquer or resettle to, will undoubtedly have inherited from some of these invaders' haplogroups. In Europe and the Middle East the most advanced societies from the Neolithic until the Renaissance (c. 8500 BCE to 1500 CE, so a period of 10,000 years) was between Mesopotamia and the Balkans (+ Italy from the heydays of the Roman Empire). This region also maintained the largest populations in the biggest cities outside India and China.
Furthermore, the biggest reservoir for nomadic incursions was just across the Caucasus, the huge Eurasian steppe, ranging from the Danube estuary to Central Asia and Mongolia. Mesopotamia has the world's longest recorded history, and it is but a succession of invasion from steppe people, be them Indo-European, Mongolian or Turkic. It is therefore unsurprising that a high level of genetic diversity from steppe haplogroups (such as R1a1a, and probably also R1b1b) should be found between Mesopotamia and the Balkans. Note that Egypt was better preserved from these invasions thanks to its distance and geographic isolation.
Some geneticists have argued for the Middle East or the Balkans as the place of origin of R1a1a or R1b1b based on the genetic diversity found in those regions (the Balkans for R1a1a and the Levant or Mesopotamia for R1b1b). This is however likely to be just the results of millennia of steppe invasions. The same phenomenon can be observed in India/Pakistan with R1a1a. How better can we explain the great genetic diversity of R1a1a in such distant places as the Balkans and the Indian subcontinent if not as a result of waves of migrations by different steppe people from the Bronze Age onwards ? Add to this that the migrants would have had more offspring in the newly conquered fertile lands than in their native steppes, and that explains it all.
I have read hundreds of time people thinking that a haplogroup must probably have originated where it is most common today. It is an assumption that even professional geneticists make, and that is nevertheless often mistaken.
One famous example is Y-haplogroup R1b. Up until recently most people, amateurs and professionals alike, thought that it must be native to western Europe because it is where it is found at the highest frequencies. The Genographic Project still hasn't changed its description of it. It reads : "30,000 years ago, a descendant of the clan making its way into Europe gave rise to marker M343, then defining marker of this haplogroup. These people dominated the human expansion into Europe, the Cro-Magnons."
There are plenty of other examples. Sometimes the place of highest frequency does coincide with the region of origin. This is usually true of subclades that have developed in an isolated region, or of relatively recent mutations. The first rule is : the older the haplogroup the less likely its place of origin will coincide with the place of highest frequency.
Y-haplogroup Q is found mostly in Siberia (Altai region) and among native Americans. Judging from percentages alone it would be easy to jump to the wrong conclusion that it originated in the pre-Colombian Americas. This is actually a caricatural example as everybody knows that America was the last continent settled by humans. But the mistake made by National Geographic and plenty of others regarding R1b is just the same. That's why it is vital to look at the age of subclades and identify where the oldest version is found. In this case, Q*, the oldest form of Q, is found in Central Asia and the Middle East. This is unfortunately to wide an area to pinpoint a place of origin.
This leads to my second rule : Paleolithic people were not sedentary but moved all the time, making it pointless to try to define a small geographic region as a Paleolithic haplogroup's place of origin.
It is often in isolated regions with small population density that older versions of haplogroups survive. Apart from Q*, Central Asia is a great region for preserving haplogroups that have disappeared about anywhere else (e.g. O*, P*, K*). The Caucasus is another good example. Its high mountains have isolated ethnic groups from one another for millennia. Unsurprisingly it is almost the only region where haplogroup F, a haplogroup that originated some 50,000 years ago, can still be found in high, and indeed sometimes very high frequencies. Isolation and the near absence of population inflow from the outside are one factor, but the small size of mountain populations is another, slowing down the mutation rate that create new haplogroups. It is not because F is found almost exclusively in the Caucasus nowadays that it originated there. I just survived there in its oldest form, but evolved elsewhere. 90% of the world population descends from F*. That's why it is important not to confuse modern distribution and place of origin. F almost certainly did not originate in the Caucasus, otherwise it would have remained stuck there rather than spreading to the whole world.
It is banal to refer to the southern Arabian peninsula as the place of origin of J1. After all, over 70% of men in Yemen belong to that haplogroup. Things are never as simple as they look at first sight. We have to ask : why did this haplogroup become dominant in that region and not another one also found there ? It could have evolved from a small group of original settlers into a virgin region. If the other haplogroups represent later immigrants, the first haplogroup present would have remained the dominant one, unless the new migrants came in huge numbers or killed the original inhabitants.
But how can we know if J1 in Yemen arrived first in an empty place or if it replaced indigenous haplogroups ? R1b did replace most of the older haplogroups in western Europe and so did O in East Asia (the aboriginal East Asians belonging to C and D).
Even if Yemen was uninhabited before J1 or all the older lineages became extinguished, how can we know if the people who arrive were already J1 or were J* that later developed into J1 in Yemen, then re-expanded northward ? To answer this question geneticists will usually analyse the genetic diversity of various regions. Theoretically, the place where J1 originated would be the one where the most subclades and STR variance can be observed. As we will now see the theory always seem easier than it actually is in practice.
2. Genetic diversity does not equal place of origin
Another common mistake is to think that a haplogroup's place of origin corresponds to the area where it has the greater genetic diversity (e.g. microsatellite diversity, number of subclades). For example, if region A has 10 different subclades for a haplogroup, with a converging age going back 15,000 years, but region B has only 2 subclades and a TMRCA going back only 8,000 years, then region A is more likely to be the place of origin. The concept is attractive, but unfortunately too simple. It doesn't take into account two essential factors : 1) the population size of a region and 2) the region's history (invasions, migrations, genocides).
First, one has to consider the historical and present population size of regions studied. Mutations, and therefore genetic diversity, happen 100 times more frequently in a population of one million individuals than among 10,000 persons. Take 200 men belonging to the same Y-haplogroup, divide them in two groups, one in an unfriendly environment with little food (e.g. Siberia) and the other in a pleasant climate favourable to agriculture (e.g. India). Let's say that after a thousand years the first group will have 1,000 descendants carrying the same haplogroup, while the second will have 100,000 descendants. Population grow has be constant with no major war, famine or epidemics causing a population bottleneck in between. In this theoretical example (because wars, famines and epidemics do happen) it is easy to see why the second group should have a much greater genetic diversity than the first one after one thousand years, although they both descend from the same original lineage !
This is probably what happened with such haplogroups as R1a1a (Y-DNA) and U2 (mtDNA) in India, as opposed to their likely place of origin in the Eurasian steppe. The same thing happened with haplogroup O, which most probably originated in Central Asia, but gained a great diversity in the much more fertile lands of East and South-East Asia.
The second fundamental point that should never be overlooked is regional history. How often was a region invaded ? Was it settled just once like in Iceland or Polynesia, or was it constantly overrun by nomadic neighbours like in the Balkans, the Middle East, or northern China ? Did people migrate in mass to other places like Germanic tribes in the 5th century, or was it a place where people came to settle like Italy or Anatolia ? Did invaders massacre the locals that they saw as inferior, or did they just take over power as new rulers of a well-established kingdom/empire that they regarded as superior to their own culture ? Did some kind of apartheid happen between more developed newcomers and more primitive indigenes, as was probably the case in Europe between Mesolithic hunter-gatherers and Neolithic farmers/herders ?
These are all essential questions that one should ask when studying population genetics. Unfortunately they are typically the least considered elements by professional geneticists, who tend to have a very poor background in history and archaeology.
Regions that were seen as attractive by nomads to plunder, conquer or resettle to, will undoubtedly have inherited from some of these invaders' haplogroups. In Europe and the Middle East the most advanced societies from the Neolithic until the Renaissance (c. 8500 BCE to 1500 CE, so a period of 10,000 years) was between Mesopotamia and the Balkans (+ Italy from the heydays of the Roman Empire). This region also maintained the largest populations in the biggest cities outside India and China.
Furthermore, the biggest reservoir for nomadic incursions was just across the Caucasus, the huge Eurasian steppe, ranging from the Danube estuary to Central Asia and Mongolia. Mesopotamia has the world's longest recorded history, and it is but a succession of invasion from steppe people, be them Indo-European, Mongolian or Turkic. It is therefore unsurprising that a high level of genetic diversity from steppe haplogroups (such as R1a1a, and probably also R1b1b) should be found between Mesopotamia and the Balkans. Note that Egypt was better preserved from these invasions thanks to its distance and geographic isolation.
Some geneticists have argued for the Middle East or the Balkans as the place of origin of R1a1a or R1b1b based on the genetic diversity found in those regions (the Balkans for R1a1a and the Levant or Mesopotamia for R1b1b). This is however likely to be just the results of millennia of steppe invasions. The same phenomenon can be observed in India/Pakistan with R1a1a. How better can we explain the great genetic diversity of R1a1a in such distant places as the Balkans and the Indian subcontinent if not as a result of waves of migrations by different steppe people from the Bronze Age onwards ? Add to this that the migrants would have had more offspring in the newly conquered fertile lands than in their native steppes, and that explains it all.