Writing a genome analyzer

Expredel · Oct 30, 2017

I'm writing a genome analyzer similar to DIYDodecad just for fun. The main problem is finding reference genomes.

I came across Otzi's at this location: https://www.ebi.ac.uk/ena/data/view/PRJEB2830

These files are several gigabytes, is the Otzi file available as a version that is compatible with the 23andme files?

Is there anywhere where I can get my hands on some 23andme, livingdna, ancestryDNA, and other genome files that are publicly made available?

I figured one interesting thing to add would be for the program to tell people their Y and mt DNA results for the ancestryDNA genome file.

So I have the following questions.

1. Any place to get 23andme/livingDNA/ancestryDNA genome files?
2. Are the RSID number and position number always the same in a genome file?
3. Are there any obvious things that genome analyzers are currently not doing?

If anyone wants to email me their genome file(s) to work with feel free to send me a PM.

Lukas · Oct 30, 2017

Otzi is not the best genome to start. He is too damaged.

Expredel · Oct 30, 2017

Thanks, I will keep that in mind.

I have the header of the 23andme genome file, hopefully I can use that to Google for other files. Anyone willing to share the headers for ancestrydna and livingdna?

Expredel · Oct 31, 2017

For anyone interested in this topic I'll update my progress. I found the personal genomes project which offers several hundred personal genomes for download.

https://my.pgp-hms.org/public_genetic_data

A 23andme file from 2014 contains 1766 Y SNPs, an AncestryDNA V2 file from 2017 contains 1692 Y SNPs.

Only 374 locations on both files are identical. So far it's very tedious to match an rsid/position to an actual Y haplogroup.

https://docs.google.com/spreadsheets/d/1jE0w48zwP3H2XV-2FBL3UBic84xQz-qBmdAl2-wJXOY is extremely useful except that it doesn't list subclades.

https://isogg.org/tree/OLDISOGG_YDNA_SNP_Index.html does list subclades.

The next step is identifying these 374 SNPs, I'll probably write a script that grabs the matches from the ISOGG table and orders them by position.

davef · Oct 31, 2017

Cool, you're into coding? I'm willing to bet you're using python, the best language for mathy stuff (numpy does a lot) outside of Mathematica (never used Mathematica).

Expredel · Oct 31, 2017

That wasn't all that hard. 52 of the 374 SNPs mentioned above are not listed by ISOGG. I've got the whole ISOGG table accessible to the program however so no tedious research is needed.

Here are the positive matches for one of the ancestryDNA genomes I downloaded.

Code:

Y R1b1a1a2
Y A1
Y BT
Y P~
Y R1b1a1a2a1a2c1g2a1
Y R1b
Y R1b1a1a2
Y P~
Y N1c1a1a1a2~
Y IJK
Y BT
Y R1b1a1a2
Y BT
Y F
Y IJK
Y R
Y R1b1a1a2
Y P
Y R1
Y A1
Y R1
Y R1
Y R1
Y P~
Y Removed from R
Y R1b1
Y P
Y R
Y R1b1a1a2
Y A1
Y R1b1a1a2
Y R1b1a1a2
Y R1b1a
Y R1b1a1a2
Y P~
Y P1
Y R1b1a1a2
Y F
Y F
Y R1b1a1a2a1
Y GHIJK
Y F
Y R1
Y K
Y F
Y P1
Y P1
Y R1b1
Y P~
Y BT
Y P~
Y R1b1a1a2
Y Removed from R
Y CF
Y F
Y R1b1a1a2a1a2c1g2a1a
Y P1
Y A1b1b2~
Y I1
Y Removed from P
Y P~
Y CT
Y P~
Y P1
Y R1
Y K
Y P1~
Y P~
Y F
Y R
Y R1
Y P~
Y R1b1a1a2a1a2c
Y P~
Y R1b1a1a
Y K2b
Y F
Y P~
Y A1
Y P~
Y R
Y P~
Y R1
Y R1b1a
Y R
Y F
Y Removed from BT
Y A1a
Y F
Y R
Y F
Y P~
Y Removed from R
Y F
Y P1
Y F
Y P1
Y BT
Y P~
Y R1b1a1a
Y R1b1a1a
Y R1
Y R1b1a1a2a1a
Y P1
Y A1
Y R1b1a1a2
Y R
Y P1
Y R
Y R1b1a1a2
Y R1b1a1a2a1a
Y R1b1a1a2
Y R1b1a1a
Y F
Y R1b1a1a
Y P~
Y R1b1a1a
Y R1b1a1a2
Y A1b
Y P~
Y R1b1a1a2a1a
Y R1b1
Y P1
Y R1b1a1a
Y P~
Y P1
Y F
Y R1b1a1a2
Y K
Y R1
Y R1b1a
Y R1b1a1a2
Y BT
Y R
Y R1b1a1a
Y R1b1a
Y P~
Y P~
Y F
Y A1b
Y BT
Y CT
Y K
Y R1b1a1a2
Y R1b1a1a
Y P1~
Y R
Y BT
Y P1
Y O (Notes)
Y F
Y Removed from BT
Y R1b1a1a2
Y P~
Y P1
Y P1
Y BT
Y P~
Y R1b1a1a2
Y CT
Y R1
Y R1b1a
Y R
Y F
Y BT
Y R1b1a1a2
Y R
Y P~
Y P1~
Y R1b1a1a
Y A1b
Y R1b1

Looks like the guy is R1b1a1a2a1a2c1g2a1a. Couple of false positives mixed in like N1c1a1a1a2.

Now the fun task of having the software figure that out all by itself.

Expredel · Oct 31, 2017

davef said:
Cool, you're into coding? I'm willing to bet you're using python, the best language for mathy stuff (numpy does a lot) outside of Mathematica (never used Mathematica).

I'm mostly shell scripting, doing all this with about 50 lines of code, works well until you run into something complicated that you can't hack your way out of.

Expredel · Oct 31, 2017

Figured to alphabetically sort the results. It does appear AncestryDNA v1 tests 886 SNPs while v2 tests 1692 SNPs. Eupedia might want to correctly report this on the dna_project_faq page.
Sorted results are a little easier to digest, here's an r1a AncestryDNA v2 readout.

Code:

Y A1
Y A1a
Y A1b
Y A1b1b2~
Y BT
Y CF
Y CT
Y F
Y GHIJK
Y I1
Y IJK
Y K
Y K2b
Y N1c1a1a1a2~
Y O (Notes)
Y P
Y P1
Y P1~
Y P~
Y R
Y R1
Y R1b
Y R1b1
Y R1b1a
Y R1b1a1a
Y R1b1a1a2
Y R1b1a1a2a1
Y R1b1a1a2a1a
Y R1b1a1a2a1a2c
Y R1b1a1a2a1a2c1g2a1
Y R1b1a1a2a1a2c1g2a1a
Y Removed from BT
Y Removed from P
Y Removed from R

The deletions are too poorly documented to be of much use. This guy is R1a and I saw N1c1a1a1a2 show up for the R1b guy as well, so this is obviously an error in the ISOGG table. Makes one wonder how reliable the rest of their data is. I'll add the printout for the R1b guy if anyone wants to compare and contact ISOGG about it. The ISOGG SNP list also inconsistently zero pads SNP positions, it should either zero pad everything or nothing.

Code:

Y A1
Y A1a
Y A1b
Y A1b1b2~
Y BT
Y CF
Y CT
Y F
Y GHIJK
Y I1
Y IJK
Y K
Y K2b
Y N1c1a1a1a2~
Y O (Notes)
Y P
Y P1
Y P1~
Y P~
Y R
Y R1
Y R1b
Y R1b1
Y R1b1a
Y R1b1a1a
Y R1b1a1a2
Y R1b1a1a2a1
Y R1b1a1a2a1a
Y R1b1a1a2a1a2c
Y R1b1a1a2a1a2c1g2a1
Y R1b1a1a2a1a2c1g2a1a
Y Removed from BT
Y Removed from P
Y Removed from R

I'll provide position numbers of false positives if anyone is interested.
Correctly determining the correct Y haplogroup based on this data is going to be slightly tedious so I'll save that for later.
Next step is checking the data against the YFull SNP list. Hopefully that will narrow down the 674 positions not listed on ISOGG.

Expredel · Nov 1, 2017

I'm adding a 23andme readout for an r1b, this is kind of interesting.

Code:

Y A1b1a1
Y A1b1b2
Y A1b1b2a1a
Y A1b1b2b1
Y B1
Y B2b1
Y BT
Y CF
Y CT
Y F
Y G (Notes)
Y H1b1a
Y H1b1b
Y I
Y K
Y M3
Y O (Notes)
Y P
Y P1
Y R
Y R1
Y R1b
Y R1b1
Y R1b1a
Y R1b1a1
Y R1b1a1a
Y R1b1a1a2
Y R1b1a1a2a
Y R1b1a1a2a1
Y R1b1a1a2a1a2c

Unless ISOGG is missing a lot of SNPs it appears AncestryDNA is the better Y haplogroup test.

AncestryDNA v2 got 845 negative matches and 172 positive matches. 23andme got 548 negative matches and 100 positive matches.

For Europeans AncestryDNA has far better results than 23andme, the only problem appears to be parsing out the results.

Expredel · Nov 1, 2017

The yfull SNP index appears to be completely useless for 23andme and AncestryDNA.

21 of 1691 SNPs are listed for AncestryDNA and 14 out of 2129 SNPS for 23andme.

https://www.yfull.com/snp-list/ lists 41331 usable SNPs. If there is another big SNP list feel free to let me know.

Another area of interest is https://www.eupedia.com/genetics/medical_dna_test.shtml though it's tedious to build a reference database as Eupedia doesn't provide the information in a machine readable table.

Expredel · Nov 2, 2017

Think I ran into a 23andme V5 genome so figured to share the readout. Looks like it's testing 3557 Y-DNA SNPs.

Code:

Y A0-T
Y A1
Y A1b
Y A1b1b2b1
Y B2b1
Y BT
Y C1b1a1a1a1
Y CF
Y CT
Y D1b2a
Y F
Y GHIJK
Y HIJK
Y I
Y I1
Y I1a2
Y I1a3a2
Y I2a2b2b
Y IJ
Y IJK
Y N1c2b2
Y O1b1a1
Y O2a2b1a1a2a1a
Y Q1a2b2a
Y Q1b1a1a1a
Y R1b1a1a2a1a2b3b
Y R1b1a1a2a1a2c1b1b1a3a1
Y R1b1a1a2a1a2c1e2b1a1

v5 seems to be an improvement over v4.

I've tried comparing a few genomes, and the hundreds of Y SNPs not listed on ISOGG don't appear to be matching anything. They're either non-European or experimental.

V4 and V5 only have 436 SNPs in common, so my guess is they are random experimental SNPs.

Edit: There are actually 3733 SNPs in v5, there must be some inconsistencies in their data file tripping up my parser.

Edit: This is a pretty peculiar readout. Looks like the guy is IJ* ?

Expredel · Nov 2, 2017

Figured to compare this same 23andme V5 data file against yfull with the following results:

Code:

 I (7642823)
 I1 (6677619)
 I1 (6681479)
 I1 (8108722)
Q Q-Y2225 (6949449)

Expredel · Nov 4, 2017

Looked at a raw data file from Genes for Good v1.2. It appears they give you a free DNA test if you fill out a bunch of health surveys using their Facebook app.

Code:

A1b1b2
B2b1
BT
CT
F
IJK
P
P1
R
R1b1a1a2
R1b1a1a2a1a2c

Their SNP choise is small but there are no false positives.

Writing a genome analyzer

Expredel

Regular Member

Lukas

Regular Member

Expredel

Regular Member

Expredel

Regular Member

davef

Princess

Expredel

Regular Member

Expredel

Regular Member

Expredel

Regular Member

Expredel

Regular Member

Expredel

Regular Member

Expredel

Regular Member

Expredel

Regular Member

Expredel

Regular Member