Eupedia Forums
Site NavigationEupedia Top > Eupedia Forum & Japan Forum
Results 1 to 13 of 13

Thread: Writing a genome analyzer

  1. #1
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    2 members found this post helpful.

    Writing a genome analyzer

    I'm writing a genome analyzer similar to DIYDodecad just for fun. The main problem is finding reference genomes.

    I came across Otzi's at this location: https://www.ebi.ac.uk/ena/data/view/PRJEB2830

    These files are several gigabytes, is the Otzi file available as a version that is compatible with the 23andme files?

    Is there anywhere where I can get my hands on some 23andme, livingdna, ancestryDNA, and other genome files that are publicly made available?

    I figured one interesting thing to add would be for the program to tell people their Y and mt DNA results for the ancestryDNA genome file.

    So I have the following questions.

    1. Any place to get 23andme/livingDNA/ancestryDNA genome files?
    2. Are the RSID number and position number always the same in a genome file?
    3. Are there any obvious things that genome analyzers are currently not doing?

    If anyone wants to email me their genome file(s) to work with feel free to send me a PM.

  2. #2
    Regular Member Achievements:
    Three Friends1000 Experience Points1 year registered
    mlukas's Avatar
    Join Date
    23-04-17
    Posts
    222
    Points
    4,558
    Level
    19
    Points: 4,558, Level: 19
    Level completed: 77%, Points required for next Level: 92
    Overall activity: 0%


    Country: Poland



    1 members found this post helpful.
    Otzi is not the best genome to start. He is too damaged.

  3. #3
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    Thanks, I will keep that in mind.

    I have the header of the 23andme genome file, hopefully I can use that to Google for other files. Anyone willing to share the headers for ancestrydna and livingdna?

  4. #4
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    For anyone interested in this topic I'll update my progress. I found the personal genomes project which offers several hundred personal genomes for download.

    https://my.pgp-hms.org/public_genetic_data

    A 23andme file from 2014 contains 1766 Y SNPs, an AncestryDNA V2 file from 2017 contains 1692 Y SNPs.

    Only 374 locations on both files are identical. So far it's very tedious to match an rsid/position to an actual Y haplogroup.

    https://docs.google.com/spreadsheets...-qBmdAl2-wJXOY is extremely useful except that it doesn't list subclades.

    https://isogg.org/tree/OLDISOGG_YDNA_SNP_Index.html does list subclades.

    The next step is identifying these 374 SNPs, I'll probably write a script that grabs the matches from the ISOGG table and orders them by position.

  5. #5
    Princess Achievements:
    Overdrive10000 Experience PointsVeteranThree Friends
    davef's Avatar
    Join Date
    19-06-16
    Location
    New York
    Posts
    2,240
    Points
    12,089
    Level
    33
    Points: 12,089, Level: 33
    Level completed: 20%, Points required for next Level: 561
    Overall activity: 11.0%


    Ethnic group
    Italian,Irish,Jewish
    Country: USA - New York



    Cool, you're into coding? I'm willing to bet you're using python, the best language for mathy stuff (numpy does a lot) outside of Mathematica (never used Mathematica).

  6. #6
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    That wasn't all that hard. 52 of the 374 SNPs mentioned above are not listed by ISOGG. I've got the whole ISOGG table accessible to the program however so no tedious research is needed.

    Here are the positive matches for one of the ancestryDNA genomes I downloaded.

    Code:
    Y R1b1a1a2
    Y A1
    Y BT
    Y P~
    Y R1b1a1a2a1a2c1g2a1
    Y R1b
    Y R1b1a1a2
    Y P~
    Y N1c1a1a1a2~
    Y IJK
    Y BT
    Y R1b1a1a2
    Y BT
    Y F
    Y IJK
    Y R
    Y R1b1a1a2
    Y P
    Y R1
    Y A1
    Y R1
    Y R1
    Y R1
    Y P~
    Y Removed from R
    Y R1b1
    Y P
    Y R
    Y R1b1a1a2
    Y A1
    Y R1b1a1a2
    Y R1b1a1a2
    Y R1b1a
    Y R1b1a1a2
    Y P~
    Y P1
    Y R1b1a1a2
    Y F
    Y F
    Y R1b1a1a2a1
    Y GHIJK
    Y F
    Y R1
    Y K
    Y F
    Y P1
    Y P1
    Y R1b1
    Y P~
    Y BT
    Y P~
    Y R1b1a1a2
    Y Removed from R
    Y CF
    Y F
    Y R1b1a1a2a1a2c1g2a1a
    Y P1
    Y A1b1b2~
    Y I1
    Y Removed from P
    Y P~
    Y CT
    Y P~
    Y P1
    Y R1
    Y K
    Y P1~
    Y P~
    Y F
    Y R
    Y R1
    Y P~
    Y R1b1a1a2a1a2c
    Y P~
    Y R1b1a1a
    Y K2b
    Y F
    Y P~
    Y A1
    Y P~
    Y R
    Y P~
    Y R1
    Y R1b1a
    Y R
    Y F
    Y Removed from BT
    Y A1a
    Y F
    Y R
    Y F
    Y P~
    Y Removed from R
    Y F
    Y P1
    Y F
    Y P1
    Y BT
    Y P~
    Y R1b1a1a
    Y R1b1a1a
    Y R1
    Y R1b1a1a2a1a
    Y P1
    Y A1
    Y R1b1a1a2
    Y R
    Y P1
    Y R
    Y R1b1a1a2
    Y R1b1a1a2a1a
    Y R1b1a1a2
    Y R1b1a1a
    Y F
    Y R1b1a1a
    Y P~
    Y R1b1a1a
    Y R1b1a1a2
    Y A1b
    Y P~
    Y R1b1a1a2a1a
    Y R1b1
    Y P1
    Y R1b1a1a
    Y P~
    Y P1
    Y F
    Y R1b1a1a2
    Y K
    Y R1
    Y R1b1a
    Y R1b1a1a2
    Y BT
    Y R
    Y R1b1a1a
    Y R1b1a
    Y P~
    Y P~
    Y F
    Y A1b
    Y BT
    Y CT
    Y K
    Y R1b1a1a2
    Y R1b1a1a
    Y P1~
    Y R
    Y BT
    Y P1
    Y O (Notes)
    Y F
    Y Removed from BT
    Y R1b1a1a2
    Y P~
    Y P1
    Y P1
    Y BT
    Y P~
    Y R1b1a1a2
    Y CT
    Y R1
    Y R1b1a
    Y R
    Y F
    Y BT
    Y R1b1a1a2
    Y R
    Y P~
    Y P1~
    Y R1b1a1a
    Y A1b
    Y R1b1
    Looks like the guy is R1b1a1a2a1a2c1g2a1a. Couple of false positives mixed in like N1c1a1a1a2.

    Now the fun task of having the software figure that out all by itself.

  7. #7
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    Quote Originally Posted by davef View Post
    Cool, you're into coding? I'm willing to bet you're using python, the best language for mathy stuff (numpy does a lot) outside of Mathematica (never used Mathematica).
    I'm mostly shell scripting, doing all this with about 50 lines of code, works well until you run into something complicated that you can't hack your way out of.

  8. #8
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    Figured to alphabetically sort the results. It does appear AncestryDNA v1 tests 886 SNPs while v2 tests 1692 SNPs. Eupedia might want to correctly report this on the dna_project_faq page.
    Sorted results are a little easier to digest, here's an r1a AncestryDNA v2 readout.
    Code:
    Y A1
    Y A1a
    Y A1b
    Y A1b1b2~
    Y BT
    Y CF
    Y CT
    Y F
    Y GHIJK
    Y I1
    Y IJK
    Y K
    Y K2b
    Y N1c1a1a1a2~
    Y O (Notes)
    Y P
    Y P1
    Y P1~
    Y P~
    Y R
    Y R1
    Y R1b
    Y R1b1
    Y R1b1a
    Y R1b1a1a
    Y R1b1a1a2
    Y R1b1a1a2a1
    Y R1b1a1a2a1a
    Y R1b1a1a2a1a2c
    Y R1b1a1a2a1a2c1g2a1
    Y R1b1a1a2a1a2c1g2a1a
    Y Removed from BT
    Y Removed from P
    Y Removed from R
    The deletions are too poorly documented to be of much use. This guy is R1a and I saw N1c1a1a1a2 show up for the R1b guy as well, so this is obviously an error in the ISOGG table. Makes one wonder how reliable the rest of their data is. I'll add the printout for the R1b guy if anyone wants to compare and contact ISOGG about it. The ISOGG SNP list also inconsistently zero pads SNP positions, it should either zero pad everything or nothing.
    Code:
    Y A1
    Y A1a
    Y A1b
    Y A1b1b2~
    Y BT
    Y CF
    Y CT
    Y F
    Y GHIJK
    Y I1
    Y IJK
    Y K
    Y K2b
    Y N1c1a1a1a2~
    Y O (Notes)
    Y P
    Y P1
    Y P1~
    Y P~
    Y R
    Y R1
    Y R1b
    Y R1b1
    Y R1b1a
    Y R1b1a1a
    Y R1b1a1a2
    Y R1b1a1a2a1
    Y R1b1a1a2a1a
    Y R1b1a1a2a1a2c
    Y R1b1a1a2a1a2c1g2a1
    Y R1b1a1a2a1a2c1g2a1a
    Y Removed from BT
    Y Removed from P
    Y Removed from R
    I'll provide position numbers of false positives if anyone is interested.
    Correctly determining the correct Y haplogroup based on this data is going to be slightly tedious so I'll save that for later.
    Next step is checking the data against the YFull SNP list. Hopefully that will narrow down the 674 positions not listed on ISOGG.
    Last edited by Expredel; 07-11-17 at 06:20.

  9. #9
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    I'm adding a 23andme readout for an r1b, this is kind of interesting.

    Code:
    Y A1b1a1
    Y A1b1b2
    Y A1b1b2a1a
    Y A1b1b2b1
    Y B1
    Y B2b1
    Y BT
    Y CF
    Y CT
    Y F
    Y G (Notes)
    Y H1b1a
    Y H1b1b
    Y I
    Y K
    Y M3
    Y O (Notes)
    Y P
    Y P1
    Y R
    Y R1
    Y R1b
    Y R1b1
    Y R1b1a
    Y R1b1a1
    Y R1b1a1a
    Y R1b1a1a2
    Y R1b1a1a2a
    Y R1b1a1a2a1
    Y R1b1a1a2a1a2c
    Unless ISOGG is missing a lot of SNPs it appears AncestryDNA is the better Y haplogroup test.

    AncestryDNA v2 got 845 negative matches and 172 positive matches. 23andme got 548 negative matches and 100 positive matches.

    For Europeans AncestryDNA has far better results than 23andme, the only problem appears to be parsing out the results.
    Last edited by Expredel; 07-11-17 at 06:19.

  10. #10
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    The yfull SNP index appears to be completely useless for 23andme and AncestryDNA.

    21 of 1691 SNPs are listed for AncestryDNA and 14 out of 2129 SNPS for 23andme.

    https://www.yfull.com/snp-list/ lists 41331 usable SNPs. If there is another big SNP list feel free to let me know.

    Another area of interest is https://www.eupedia.com/genetics/medical_dna_test.shtml though it's tedious to build a reference database as Eupedia doesn't provide the information in a machine readable table.

  11. #11
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    Think I ran into a 23andme V5 genome so figured to share the readout. Looks like it's testing 3557 Y-DNA SNPs.

    Code:
    Y A0-T
    Y A1
    Y A1b
    Y A1b1b2b1
    Y B2b1
    Y BT
    Y C1b1a1a1a1
    Y CF
    Y CT
    Y D1b2a
    Y F
    Y GHIJK
    Y HIJK
    Y I
    Y I1
    Y I1a2
    Y I1a3a2
    Y I2a2b2b
    Y IJ
    Y IJK
    Y N1c2b2
    Y O1b1a1
    Y O2a2b1a1a2a1a
    Y Q1a2b2a
    Y Q1b1a1a1a
    Y R1b1a1a2a1a2b3b
    Y R1b1a1a2a1a2c1b1b1a3a1
    Y R1b1a1a2a1a2c1e2b1a1
    v5 seems to be an improvement over v4.

    I've tried comparing a few genomes, and the hundreds of Y SNPs not listed on ISOGG don't appear to be matching anything. They're either non-European or experimental.

    V4 and V5 only have 436 SNPs in common, so my guess is they are random experimental SNPs.

    Edit: There are actually 3733 SNPs in v5, there must be some inconsistencies in their data file tripping up my parser.

    Edit: This is a pretty peculiar readout. Looks like the guy is IJ* ?
    Last edited by Expredel; 07-11-17 at 06:17.

  12. #12
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    Figured to compare this same 23andme V5 data file against yfull with the following results:

    Code:
     I (7642823)
     I1 (6677619)
     I1 (6681479)
     I1 (8108722)
    Q Q-Y2225 (6949449)
    Last edited by Expredel; 07-11-17 at 06:15.

  13. #13
    Regular Member Achievements:
    1000 Experience PointsVeteran

    Join Date
    24-02-15
    Posts
    243
    Points
    2,727
    Level
    14
    Points: 2,727, Level: 14
    Level completed: 93%, Points required for next Level: 23
    Overall activity: 11.0%


    Country: United States



    Looked at a raw data file from Genes for Good v1.2. It appears they give you a free DNA test if you fill out a bunch of health surveys using their Facebook app.

    Code:
    A1b1b2
    B2b1
    BT
    CT
    F
    IJK
    P
    P1
    R
    R1b1a1a2
    R1b1a1a2a1a2c
    Their SNP choise is small but there are no false positives.
    Last edited by Expredel; 07-11-17 at 06:11.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •