PDA

View Full Version : How many samples are statistically significant ?



JAK2
25-11-10, 17:06
I am new comer in this topic and as a former MD, I am not really convinced by 400 to 2400 samples as a reference to a true Result ...
What does mean in such a complex topic as populations genetics,a result for a country of 80 millions, obtained with 2450 samples...; as I just read about Germany on the Forum s results tables...(!) and much less:until a very few hundreds; for many other places and groups...
I have read some critics about many attitudes considered as political Agenda and abuse of proportions (considering as a whole definitive majority, results concerning only 45% of an ethnic group.....What about all other people which don't enter in the Box??)
Quick Progresses show that some peremptory results are mostly controversial...
Who can give me more informations in that sense of questionning?
Thanks a lot
Warmest regards
:disappointed:

Maciamo
03-10-11, 11:59
Sorry for the very delayed reply. I sometimes miss some threads.

The number of samples required to obtain statistically significant results depends chiefly on two factors :

1) the size of the population tested. Obviously the sample size for Luxembourg can be smaller than for Germany, and Germany smaller than China.

2) the heterogeneity of the population tested. Some modern populations have grown very fast over the course of the last few centuries, while others have grown more steadily over the ages. I pointed out in a thread 3 years ago (http://www.eupedia.com/forum/showthread.php?25061-Historical-populations-of-Europe-changing-proportions) that in the early 19th century, Belgium was twice more populous than the Netherlands, while today the latter has a population 60% bigger than Belgium. Within Belgium, Wallonia use to be more populous than Flanders a few centuries ago. Flanders is now nearly twice as populous due to a much faster growth in the 20th century.

In 1350, France had a population of 20 million, only three times less than now. If we deduct all the people with foreign surnames in France (immigration of the last few centuries), we see that the French population has only grown 2.5 folds in the last 750 years, which is very little. In comparison, in 1350 Britain had a population of roughly 4 million (3m in England), Poland 2 million and Russia 8 million. These countries' populations have grown approximately 15 to 20 folds. Italy had 10 million and Spain 7 million - each experienced about a 6 fold increase.

So it's only natural that the genetic diversity should be higher in countries like France, Belgium, Italy and Spain than in northern or eastern Europe. In fact, the size of the historical population since the Middle Ages is fairly well reflected by the diversity of surnames (http://www.eupedia.com/europe/european_family_names.shtml). Italy, France and Belgium have the highest number of surnames per capita in Europe, while Scandinavia, the British Isles and most Slavic countries have among the lowest.

The second factor is the most important, yet also the most overlooked.

So what is the minimum sample size necessary to be relevant ? In northern and eastern Europe, where the medieval population density was much lower than in the former Roman Empire, I would say that 50 samples per million inhabitant (now) already gives a pretty good idea. This means 3000 samples for Britain, 2000 for Poland, or 250 for Denmark or Finland. Countries like Ireland clearly have more than enough Y-DNA samples to have a quite accurate picture. For countries like France, Belgium, Italy or Greece, 250 samples per million inhabitant are necessary, and they need to be selected carefully to cover every region, as there are often major disparities even in small adjacent regions (e.g. Cantabria vs Basque country, or Crete vs Peloponese, or Auvergne vs Rhône-Alpes). In other words, Belgium and Greece would need 2500 samples, France and Italy 15,000 samples.

Spain and Portugal are a bit different because a large part of the medieval (Muslim and Jewish) population was expelled in the 15th century, and the modern population therefore grew from a smaller portion of the medieval population, which explains why the surname diversity is also lower. I would place them in an intermediary category, along with Germany, and estimate that 100 samples per million inhabitant is representative enough (so 1000 samples for Portugal, 4000 for Spain, and 8000 for Germany).

MarTyro
03-10-11, 17:16
minimum sample size necessary to be relevant
France and Italy 15,000 samples
8000 for Germany
4000 for Spain
3000 samples for Britain
Belgium and Greece 2500 samples
2000 for Poland
1000 samples for Portugal
250 for Denmark or Finland
Interesting. I guess France, Italy and also Germany and Spain have not reached that size by far. I would also add Switzerland and Austria as important alpine refugiums, the Balkan as important old melting pot and Hungary/Czechia/Slovakia as indicators of some immigration. Also the haplogroup-definition/nomenclature to me seems important: older studies can make problems. So we must expect some news (subclades, hg-enclaves, etc.). There should be build a central Haplogroup-distribution-database (maybe an EU-Project?), where every scientist and interested researcher can make his calculations; but there is none?

realdealt
15-09-13, 10:36
So you are saying 1000 sample size for Portugal and 4000 for Spain......hmmm.....I have a map of R1b-M269 distribution at IberianRoots based on 1608 for Portugal and 2032 for Spain (total 3640 across Iberia). I believe another key in sampling bias depends on the geographic distribution of the samples taken and not just on heterogeneity (as I understand it). The samples have heterogeneity because of all the other haplogroups they are grouped with but they also have to have a fairly good geographic distribution for the map to be "accurate".

What about if I have 11000 samples from lineages originating out of Iberia but now scattered across the Americas primarily? Wouldn't this be regarded as a decent un-biased sample set?

Yaan
15-09-13, 10:46
I am new comer in this topic and as a former MD, I am not really convinced by 400 to 2400 samples as a reference to a true Result ...
What does mean in such a complex topic as populations genetics,a result for a country of 80 millions, obtained with 2450 samples...; as I just read about Germany on the Forum s results tables...(!) and much less:until a very few hundreds; for many other places and groups...
I have read some critics about many attitudes considered as political Agenda and abuse of proportions (considering as a whole definitive majority, results concerning only 45% of an ethnic group.....What about all other people which don't enter in the Box??)
Quick Progresses show that some peremptory results are mostly controversial...
Who can give me more informations in that sense of questionning?
Thanks a lot
Warmest regards
:disappointed:
Everything bellow 500 is a joke. So at least 500 :)

JAK2
16-09-13, 09:22
So you are saying 1000 sample size for Portugal and 4000 for Spain......hmmm.....I have a map of R1b-M269 distribution at IberianRoots based on 1608 for Portugal and 2032 for Spain (total 3640 across Iberia). I believe another key in sampling bias depends on the geographic distribution of the samples taken and not just on heterogeneity (as I understand it). The samples have heterogeneity because of all the other haplogroups they are grouped with but they also have to have a fairly good geographic distribution for the map to be "accurate".

What about if I have 11000 samples from lineages originating out of Iberia but now scattered across the Americas primarily? Wouldn't this be regarded as a decent un-biased sample set?


I thank you all for your very interesting answers which confirm my ideas about the lack of serious of most Results in the Topic...
"Grand Ma 's Tales" as say the Jews for fanciful stories...!
Jak2

MOESAN
27-09-13, 21:22
I agree the samples are for now a bit tiny for so big countries but yet we have some sketches that do not seem the pure effect of hazard when we know the history...
somes regions in Europe are pretty good sampled, someones (I think in France principally) are veryscarce -
with Maciamo we had the opportunity to see some firstable surprising %s becoming less dubious (I think by instance in Denmark, Iceland, Norway and Sweden) - we saw too some surprising new results about Italy showing we need more data
as a whole, the very dominant HGs do not show big variations - but sample size is very important for minor HGs - (I did mistakes or unbased theories on tiny numbers yet, but it was bets! and it helps to maintain my brain at work) -
for autosomals I think we don't need so big samples

have a good evening