Utilizing AI for WGS30x FASTQ-Plink conversion & AADR Merge for Downstream Analysis in Admixtools2

This is how I did it,

I aligned the FASTQ to HG19 with BWA Creating the SAM file

Then I created the BAM file which I then sorted, indexed, and marked for duplicates

I then used WGSextract to create the BAI and 23andme_V3.txt file.

I then used Plink 1.9 to convert it to a VCF.

From the VCF I create the Plink files.

Finally, I converted it with Eigensoft to Eigenstrat, and merged with the Reich Lab data base.

I will finally proceed with my downstream analysis in Admixtools.
 
THseZG1.png


Accurate results!

R1P83nw.png
 
I discovered that mergit has no way of retaining the wholeness of the DATA sets in merging, so it removes SNPs from the data set.

However, I am on the path to obtaining the correct merged file in Plink:

ChatGPT summary of my project thus far:

Alright, let's review:

1. **Initial Objective**: You aimed to merge your personal genome (Jovialis) with the Reich Lab dataset while prioritizing the integrity of the Reich Lab data.

2. **Data Format Issues**: Initially, the Reich Lab data was in EIGENSTRAT format, which led to some issues when trying to directly merge it with PLINK binary format. So, we converted the EIGENSTRAT data to PLINK binary format using the `convertf` tool from the Eigensoft package.

3. **Merging Issues**: Upon attempting to merge, there were discrepancies due to variants that didn't match in terms of alleles. This is a typical challenge in genetic analyses when combining data from different sources.

4. **Flipping**: To address this, we flipped the alleles for your personal genome using the `--flip` command in PLINK.

5. **Exclusion of Discrepancies**: Despite the flipping, there were still some discrepancies (variants with three or more alleles). We then excluded these problematic variants from your personal genome data to ensure a smoother merge.

6. **Final Merge**: We merged the refined Jovialis data with the Reich Lab data. The command also accounted for individuals with ambiguous sex to prevent any issues.

The current status seems to be in the middle of the three-pass merging process, which PLINK uses for large datasets. This will combine the two datasets, excluding the previously identified problematic SNPs.

Once the merge is complete, you will have a combined dataset in PLINK binary format (`.bed`, `.bim`, `.fam` files) with the Reich Lab data intact and your personal genome appended.

If the merging process completes without errors, you can then proceed with any downstream analyses you're interested in, such as principal component analysis (PCA), association analyses, or others.
 
I want to make prompts for AI I can use to easily convert the data once new dataset updates come out.

Right now I am still trying to figure out the merging process to combine in plink format. I was actually incorrect to say I had gotten it, considering you have to apply some flags and do some re-ordering of the 23andme file.

Code:
**Complete Process: Filtering and Reordering 23andMe Data**


*Objective*: To filter and reorder your 23andMe data, ensuring it contains only SNPs common with the Reich Lab dataset and is ordered correctly for compatibility with downstream analyses.


**Step 1: Initial Setup**


- Ensure you have the following files ready:
  - `Reich_Lab_PLINK_binary.bim`: The Reich Lab dataset in PLINK format.
  - `23andMe_Raw_Data.txt`: Your 23andMe raw data.
  - `23andMe_SNP_List.txt`: Your list of SNP IDs.


**Step 2: SNP Reordering**


1. **Create an Index for Your SNP Data**:
   ```bash
   awk 'NR==FNR {a[$1]=NR; next} $2 in a {print a[$2]"\t"$0}' Reich_Lab_PLINK_binary.bim 23andMe_Raw_Data.txt > Jovialis_Indexed_SNPs.txt
   ```
   - Explanation: This command creates an index of your 23andMe SNP data, including the desired order based on the Reich Lab dataset.


2. **Sort Indexed SNPs by Order**:
   ```bash
   sort -k1,1n Jovialis_Indexed_SNPs.txt | cut -f2- > Jovialis_Reordered_SNPs.txt
   ```
   - Explanation: This command sorts your indexed SNPs based on the order specified in the index, ensuring they match the Reich Lab dataset's order.


**Step 3: SNP Filtering**


3. **SNP Filtering**:
   ```bash
   awk 'NR==FNR{a[$1]; next} $2 in a' 23andMe_SNP_List.txt Jovialis_Reordered_SNPs.txt > Jovialis_Common_SNPs.txt
   ```
   - Explanation: This command filters your reordered SNP data to retain only the SNPs that are common with the Reich Lab dataset.


**Step 4: Verification (Optional)**


4. **Verify the Common and Reordered SNP Data** (Optional):
   ```bash
   head -n 10 Jovialis_Common_SNPs.txt
   ```
   - Explanation: To inspect the first 10 lines of the common and reordered SNP data and ensure that the SNPs are in the correct order.


Following these steps ensures that you filter and reorder your 23andMe data correctly to match the order in the Reich Lab dataset while retaining all common SNPs. If you encounter any issues or have additional questions, please let me know.
 
Last edited:
Now I can confirm that I have truly been able to merge my WGS30x sample correctly into the Reich Dataset which I will now explore in Admixtools:
Code:
597573 variants and 20504 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 20504 are controls.


The variants are exactly as they should be. Along with the inclusion of my sample to the total amount. I can confirm the quality of my analysis will be coherent.

I figured it out, the reason why I was running into errors was because I had to convert the Reich Lab plink with alleleACGT. This is cruial if you want to merge your 23andme Plink files into the Reich Lab plink files:

Code:
plink \
--bfile "/mnt/d/Bioinformatics/v54.1.p1_HO_public_Plink/Reich_Lab_PLINK_binary" \
--alleleACGT \
--make-bed \
--out "/mnt/d/Bioinformatics/01_Experimental_Output/Reich_Lab_ACGT"
 
STILL despite making sure all of the SNPs are intact, it seems like there is an unexpected and significant discrepancy with the plink version results and the original unmodified eigenstrant results in Admixtools.

I tested it with the initial Plink conversion of the Reich Lab results, and it seems like the origin of the issue lies in how I converted it to plink. So it wasn't a result of the merge, but rather the conversion from eigenstrat to plink for the Reich lab results.

I need figure this out, and make sure the results are at least 99.5% consistent between the converted Reich Lab plink (and later the merged version with my sample) and the original Eigenstrat files.
 
23andme_V3.txt file.

I then used Plink 1.9 to convert it to a VCF.

From the VCF I create the Plink files.
You don't need to convert to VCF to have PLINK files.
Just use this command in PLINK on 23andme_txt file.

plink --23file your_raw.txt --chr 1-22 --make-bed --out newplinkfile

Flag --chr 1-22 is optional but you can have error regarding raw created from WGS about Y, X, MT rows and you don't need them in qpadm analyses.
 
1. **Starting with FASTQ files**:
- The raw sequencing data from sequencing platforms, typically in pairs (paired-end sequencing) for Illumina data.

2. **Reference Genome Preparation**:
- Obtain a reference genome (like hg19).
- Index the reference genome for alignment.
3. **Alignment with BWA**:
- Map the sequencing reads from the FASTQ files to the reference genome.
- This produces a SAM (Sequence Alignment/Map) file.
4. **SAM to BAM Conversion**:
- Convert the SAM file to a more compact binary format, BAM, using tools like Samtools.
- Sort and index the BAM file.
- Mark duplicates (typically using Picard or Samtools).

You can do everything in https://usegalaxy.org/ online:) Just create free account there. It is especially useful for fastq academic files hosted on ENA when there are no bams only fastq files. But you can also upload there your wgs file. But I show first how to convert ancient samples from ENA.

Let's start with latest paper on Pannonia and North Italian samples https://www.ebi.ac.uk/ena/browser/view/PRJNA811958

Check such columns in "Columns selection" to see fastq files and their size, optionally you can other too.
Screenshot-2023-09-10-at-18-38-33-ENA-Browser.png

Click on file name to be moved to usegalaxy.org. Before remember to make account thereand log in to have access to all tools.


Wait to have your file on the right on the green background, so now you can play with it. In search field on the left type "BWA Mem".
Screenshot-2023-09-10-at-18-45-20-Galaxy.png

And select what is on the image (sometimes there is more option for human genome, just check HG19, shoudl work most of the time). What is below don't change. Then you must click run and wait.

And download created bam and bai files. Then you can processs them in WGS Extract to have raws.
Bez-nazwy-1.png


I always try to select fastq which are one file per sample, not two files, but if no other option and there are only like that, you can choose in BWA paired not single end reads. But usually I have worse results with them.
There is also replacement tooll fro BWA-MEM named Bowtie on usegalaxy but is more complicated for newbie and results are nearly identical so I prefer BWA-MEM.
 
You don't need to convert to VCF to have PLINK files.
Just use this command in PLINK on 23andme_txt file.

plink --23file your_raw.txt --chr 1-22 --make-bed --out newplinkfile

Flag --chr 1-22 is optional but you can have error regarding raw created from WGS about Y, X, MT rows and you don't need them in qpadm analyses.
Thanks! I will use that next time I need to convert a 23andme file.

I actually used this method with a script, and the results said it passed all the QCs it prompts you for.

 
It's good numbers.

Gz.
 
You don’t seem to have any WHG. The standard error is higher than the weight.
 
4H3GeHQ.png


Haven't verified the outputs but the proportions look accurate according to studies.

Code:
> results$weights
# A tibble: 4 × 5
  target   left        weight     se      z
  <chr>    <chr>        <dbl>  <dbl>  <dbl>
1 Jovialis Steppe_EMBA 0.237  0.0744  3.18
2 Jovialis Anatolia_N  0.561  0.0442 12.7 
3 Jovialis WHG         0.0230 0.0337  0.683
4 Jovialis CHG_Iran_N  0.179  0.0815  2.20
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank Steppe_EMBA Anatolia_N      WHG CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>      <dbl>    <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     9   13.1 1.59e-  1      3       0.237      0.561  0.0230       0.179 TRUE     NA         NA      NA         NA

> results$weights
# A tibble: 4 × 5
  target           left        weight     se     z
  <chr>            <chr>        <dbl>  <dbl> <dbl>
1 Italian_North.HO Steppe_EMBA 0.302  0.0252 11.9
2 Italian_North.HO Anatolia_N  0.593  0.0132 45.1
3 Italian_North.HO WHG         0.0460 0.0120  3.83
4 Italian_North.HO CHG_Iran_N  0.0593 0.0264  2.24
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank Steppe_EMBA Anatolia_N     WHG CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>      <dbl>   <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     9   79.7 1.88e- 13      3       0.302      0.593  0.0460     0.0593 TRUE     NA         NA      NA         NA

> results$weights
# A tibble: 4 × 5
  target left        weight      se     z
  <chr>  <chr>        <dbl>   <dbl> <dbl>
1 TSI.SG Steppe_EMBA 0.256  0.0226  11.3
2 TSI.SG Anatolia_N  0.577  0.0130  44.3
3 TSI.SG WHG         0.0431 0.00866  4.97
4 TSI.SG CHG_Iran_N  0.124  0.0276   4.51
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank Steppe_EMBA Anatolia_N     WHG CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>      <dbl>   <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     9   96.2 9.03e- 17      3       0.256      0.577  0.0431      0.124 TRUE     NA         NA      NA         NA

  target           left        weight     se      z
  <chr>            <chr>        <dbl>  <dbl>  <dbl>
1 Italian_South.HO Steppe_EMBA 0.161  0.0356  4.51
2 Italian_South.HO Anatolia_N  0.571  0.0232 24.6 
3 Italian_South.HO WHG         0.0100 0.0182  0.552
4 Italian_South.HO CHG_Iran_N  0.258  0.0407  6.34
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank Steppe_EMBA Anatolia_N      WHG CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>      <dbl>    <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     9   30.2 4.01e-  4      3       0.161      0.571  0.0100       0.258 TRUE     NA         NA      NA         NA


> results$weights
# A tibble: 4 × 5
  target      left        weight     se     z
  <chr>       <chr>        <dbl>  <dbl> <dbl>
1 Sicilian.HO Steppe_EMBA 0.128  0.0345  3.72
2 Sicilian.HO Anatolia_N  0.551  0.0198 27.8
3 Sicilian.HO WHG         0.0385 0.0140  2.75
4 Sicilian.HO CHG_Iran_N  0.283  0.0403  7.02
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank Steppe_EMBA Anatolia_N      WHG CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>       <dbl>      <dbl>    <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     9   97.4 5.40e- 17      3       0.128      0.551  0.0385       0.283 TRUE     NA         NA       NA        NA

> results$weights
# A tibble: 4 × 5
  target       left        weight     se     z
  <chr>        <chr>        <dbl>  <dbl> <dbl>
1 Sardinian.HO Steppe_EMBA 0.0964 0.0259  3.72
2 Sardinian.HO Anatolia_N  0.779  0.0163 47.8
3 Sardinian.HO WHG         0.0782 0.0129  6.08
4 Sardinian.HO CHG_Iran_N  0.0463 0.0307  1.51
> results$popdrop
# A tibble: 15 × 15
   pat      wt   dof chisq         p f4rank Steppe_EMBA Anatolia_N      WHG CHG_Iran_N feasible best  dofdiff chisqdiff p_nested
   <chr> <dbl> <dbl> <dbl>     <dbl>  <dbl>       <dbl>      <dbl>    <dbl>      <dbl> <lgl>    <lgl>   <dbl>     <dbl>    <dbl>
 1 0000      0     9  100. 1.38e- 17      3      0.0964      0.779   0.0782     0.0463 TRUE     NA         NA      NA         NA
 
This is common in the South:

bT9HqmH.jpg
The model looks even better when I move WHG from the right group to the left group.
> results$weights
# A tibble: 3 × 5
target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Jovialis Steppe_EMBA 0.330 0.0544 6.07
2 Jovialis Anatolia_N 0.583 0.0443 13.1
3 Jovialis CHG_Iran_N 0.0871 0.0740 1.18
> results$popdrop
# A tibble: 7 × 14
pat wt dof chisq p f4rank Steppe_EMBA Anatolia_N CHG_Iran_N feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 000 0 11 20.4 3.98e- 2 2 0.330 0.583 0.0871 TRUE NA NA NA NA
2 001 1 12 68.2 7.00e- 10 1 0.372 0.628 NA TRUE TRUE 0 -378. 1
3 010 1 12 446. 6.46e- 88 1 -0.100 NA 1.10 FALSE TRUE 0 240. 0
4 100 1 12 206. 1.68e- 37 1 NA 0.563 0.437 TRUE TRUE NA NA NA
5 011 2 13 1436. 3.32e-299 0 1 NA NA TRUE NA NA NA NA
6 101 2 13 533. 1.51e-105 0 NA 1 NA TRUE NA NA NA NA
7 110 2 13 565. 1.78e-112 0 NA NA 1 TRUE NA NA NA NA

M8OmTYF.png


I created the Statistical Robustness test as a measure of success for the results.
 
I am creating a script in python with ChatGPT code interpreter to create a tool that measures the overall p-value output from qpAdm in Admixtools.

I had it also dispense an explanation of the scoring system, the basis for the criteria of the test. I am going to script functions for more evaluation tools for the other scores, and make them all into one test that will be weighted and have an overall quality score.

I will script this tool in python and use it to verify my output results from the qpAdm to have AI provide an overall quality score for the model from 1-100%.

Example of the output results:

Here's the evaluation based on the provided p-value \(0.0794\):

- **Given p-value:** \(0.0794\)
- **Score:** Approximately \(87.21\) (out of 100)

**Scoring System Explanation:**
- A p-value of 0.05 receives the highest score of 100, as this is the conventional threshold for statistical significance and suggests a high confidence in the biological relevance of the observed result.
- p-values between 0 and 0.05 increase linearly in score from 90 to 100, indicating increasing confidence in the biological relevance as the p-value decreases.
- p-values between 0.05 and 0.1 decrease linearly in score from 100 to 90, indicating decreasing confidence in the biological relevance as the p-value increases above 0.05.
- p-values between 0.1 and 1 decrease linearly in score from 90 to 0, suggesting progressively lower confidence in the biological relevance.

**Biological Relevance Context:**
The test is specifically designed to determine if a particular parameter falls within a biologically relevant range. A p-value provides insight into how confidently we can assert that the observed result is biologically relevant and not due to random chance alone. Lower p-values indicate stronger evidence against the null hypothesis and hence a higher confidence in the biological relevance of the observed result.

**Disclaimers:**
1. A p-value does not provide information about the practical or clinical significance.
2. A low p-value in a large sample might indicate statistical but not practical significance.
3. Multiple testing without correction can lead to misleading p-values.
4. Extremely low p-values in certain contexts might be too good to be true.
5. The scoring system is a guideline for interpreting p-values and should be used in conjunction with broader study design and objectives.

Would you like any further assistance or have any other questions?
 
Last edited:

This thread has been viewed 8268 times.

Back
Top