Data quality control

Bgtrak

Regular Member
Messages
49
Reaction score
13
Points
8
Hello guys,

We have the latest data available for SouthernArc publications. The Zipped data has 5940 individuals, both previously known and new.
The new data is for 778 individuals.

Because of missing data, I had to do the full quality control on the provided data:

plink --memory 12000 --threads 2 --bfile SouthernArc_Public --autosome --geno 0.4 --mind 0.4 --maf 0.05 --nonfounders --allow-no-sex --recode --out

After passing the QC 2150 individuals left. (both previously known and new ). From these 2150, only 336 are the new. Other 442 did not pass the QC threshold of 60 %.
 
Many will never make it to yfull Y-tree methinks (although Reich Lab used yfull.com services to determine the hg afaik ) . Downloaded a sample's .fastq.gz and .bam file --> 123-ish mb and 54mb respectively (!) ...
I tried a couple of times to upload samples from another study on my own dime- got rejected by yfull and gave up . The mt-dna fasta might be fine though (managed to upload successfully one result from a study).
 
How QC works in PLINK

Missingness per SNP: --geno value

Missingness per individual: --mind value

Minor allele frequency: --maf value

Some explanation for QC: genomicsbootcamp.github.io /book/genotype-data-quality-control.html
 
The data quality check on the latest dataset published by Lazaridis et al is showing there is a lot of Missingness per SNP / Missingness per individual .
Only
336 individuals have less than 40% missingness.
 
Did anyone else make a quality control on the new published data?
What I am telling is that only 340 individuals have less than 40% missingness. The other 438 were not fully scanned or could not provide enough coverage details. Unfortunately the new published data is already mixed with the old data, so if you want to get only the new samples you have to extract them from the big database.
Another issue for me: the PCA data provided in the main publication is not good. They used again "projection" of the old samples over the new contemporary data, which is not a reliable method.

(C) Principal components analysis of ancient individuals projected on modern West Eurasian variation. Country names are represented by three-letter International Standards Organization (ISO) codes.

This kind a projection does not give the same results if you do a full calculation for PCA according to the rules.
 
rafc, on AG concerning the e-v13 samples :

A small overview of my checks on the samples:

Boyanovo:
I18792: V13, very bad sample

Rozovo:
I19500: PH1246>BY14151>BY14160 (no coverage on BY14150)

Kapitan Andreevo:
I20185: could be an early sidebranch of S3003, he has 2 SNP's with one positive read, but many more with many negative reads
I19495: no positives on V13 or below, extremely bad sample
I20180: E-CTS5856
I19490: V13-
I20181: E-CTS5856
I20183: likely E-FGC68911
I19494: negative for 2 V13 equivalents, so must be L618*, but very bad sample

Svilengrad:
I19487: no positives on V13 or below, extremely bad sample

Križ Brdovečki (Croatia):
I5724: Y16721 (but only one read)

Isar Marvinci (N Macedonia):
I10166: no positives on V13 or below, extremely bad sample

Iznik (Turkey):
I8366: E-PH1246
I8367: E-PH1246

Some observations:
-Incredible that the Reich lab still produces such bad results in 2022. We've had studies this year which had samples whose average quality was 10X as good as the better samples here. Many of these samples have hardly any coverage.
-Now that scientists have discovered you can write a simple script to make accurate calls this becomes a lot less fun, the calls in the annex of the paper were already very good
-A lot of diversity in Kapitan Andreevo. Happy to find some S7461 which has a close connection to the Eastern Balkans.
 

This thread has been viewed 1576 times.

Back
Top