Admixtools Utilizing AI for WGS30x FASTQ-Plink conversion & AADR Merge for Downstream Analysis in Admixtools2

If you want some feedback, you are wasting your time.

The highest SNP count possible you'll get with the Reich data is 1.24mil SNPs. You can hit that number easy if you extract a raw file from your BAM file.

Then you can convert with plink.
 
So now, instead of trimming the all 15 bases, the AI recommends I should just trim the initial two, because it may stabilize the rest.

Hopefully that would rectify the issue for Per Base Sequence Content, to bring the WARN to PASS.

To address the Per Sequence GC Content, I will need to use a tool like CorrectGC. But I will do that out afterwards.

More computational time ahead. But I don't find it discouraging, in fact it is intriguing to me, to find the solution to these complex issue.
 
If you want some feedback, you are wasting your time.

The highest SNP count possible you'll get with the Reich data is 1.24mil SNPs. You can hit that number easy if you extract a raw file from your BAM file.

Then you can convert with plink.

The problem is that the CRAM and BAM files provided by Nebula are in HG38.


I had converted the VCF to plink, and it was the same issue.


I looked into converting it to HG19 post-FASTQ, but it would reduce the accuracy.
 
The problem is that the BAM files provided by Nebula are in HG38.

I had converted the VCF to plink, and it was the same issue.


WGSExtract can produce a raw dna file with all SNPs from hg38.

But, if you want to use the vcf you can use DNA kit studio that will convert it to 23afile txt that plink can read.
 
I feel like this exercise is also interesting to me for the sake of knowing how to do it, and what goes into it.
 
WGSExtract can produce a raw dna file with all SNPs from hg38.

But, if you want to use the vcf you can use DNA kit studio that will convert it to 23afile txt that plink can read.

I do have my raw data in combined format from when I ran in on WGSExtract a couple years ago. Later, I'll convert it to plink and check it out. Right now, my PC is in the process of trimming the two bases from the FASTQ pair-ended files.

I will still endeavor to see this exercise to completion. But it would be interesting to compare the two.
 
Often on ENA, many files are only in FASTQ format. This is also a learning exercise for me to finally figure out how to process them. I see it would require a "Monstrous" tier PC to produce multiple in a reasonable amount of time:

ZAJDnrA.png


https://www.logicalincrements.com/
 
This is also interesting to me to explore the uses of artificial intelligence. I find it impressive, that it could read the data from a text file upload and accurately replicate what FASTQC shows, and give analysis

RiMrziG.png




I started with the less aggressive trim, for only the first 2 bases, as it will possibly fix the frequency issue. If not, I'll have to be more aggressive, and do 10, and if need be all 15.
 
QvQ0HHZ.png

Since I am a human being, the bimodal distribution of the Per Sequence GC Content graphic is not due to a genuine biological difference.


Rather, it is likely due to contamination (such as bacteria) or some technical issue.


At any rate, once I resolve the Per Sequence Base issue, I will employ FASTP to trim GC content below 20%-70% GC content.


CorrectGC is supposed to be used Post-alignment to eliminate any bias from the Genome Reference HG19.
 
Last edited:
Seems like the best path forward will be to proceed to alignment without having to trim the FASTQ.

Removing the first two bases creates a new issue in Sequence Length Distribution. Therefore, the warning at least at this stage cannot be fixed without creating issues else where, i.e. non-uniform sequence lengths.

After doing some investigating, the warnings are likely due to high sensitivity of the FASTQC tool, that will flags due to common processing issues. But do not impact downstream analysis, especially with regard to ancestry analysis with Admixtools.

Overall, the quality of the Nebula FASTQ pair-ends are high.

At any rate, I am going to change gears and process my raw data to plink format, because I'm tired at looking at outputs generating endlessly.

Nevertheless, after I process the plink from the raw data, I'll resume this endeavor and go straight to aligning the FASTQs with HG19 using BWA.

I decided to terminate the process, because I first need to conduct FASTQC and perform all of the cleaning of the data before alignment with other tools.

Nevertheless, at least now I know I can facilitate the process in terms of creating the sam.

I know for the purpose of ancestry analysis the difference would have been minimal (on the order of 0.1%). As well as most of the really important quality controls come post-alignment. However, I strive for accuracy, and it would really bug me moving forward.

Thus I will begin again.

I would have been finished by now had I not decided to do this. :)
 
Often on ENA, many files are only in FASTQ format. This is also a learning exercise for me to finally figure out how to process them. I see it would require a "Monstrous" tier PC to produce multiple in a reasonable amount of time:

ZAJDnrA.png


https://www.logicalincrements.com/

The ancient ENA ftp fastq downloaded samples are treated differently than the modern samples. They are usually short read and could be contaminated with other bugs or creatures so they need to be filtered to ensure you only get human genetic material. I am in the Caribbean right now and when I get back in a few days I will explain it further :)
 
The ancient ENA ftp fastq downloaded samples are treated differently than the modern samples. They are usually short read and could be contaminated with other bugs or creatures so they need to be filtered to ensure you only get human genetic material. I am in the Caribbean right now and when I get back in a few days I will explain it further :)

Nice! Enjoy your trip!
 
Seems like the best path forward will be to proceed to alignment without having to trim the FASTQ.

Removing the first two bases creates a new issue in Sequence Length Distribution. Therefore, the warning at least at this stage cannot be fixed without creating issues else where, i.e. non-uniform sequence lengths.

After doing some investigating, the warnings are likely due to high sensitivity of the FASTQC tool, that will flags due to common processing issues. But do not impact downstream analysis, especially with regard to ancestry analysis with Admixtools.

Overall, the quality of the Nebula FASTQ pair-ends are high.

At any rate, I am going to change gears and process my raw data to plink format, because I'm tired at looking at outputs generating endlessly.

Nevertheless, after I process the plink from the raw data, I'll resume this endeavor and go straight to aligning the FASTQs with HG19 using BWA.



I would have been finished by now had I not decided to do this. :)

You know I just don't want to accept anything less than the best that can be achieved.

I've looked at a few videos on Youtube, and when a Per Base Sequence Content is in WARN, it looks just like mine. It is indeed cause by a primer used in the sequencing process. To me it looks like they aggressively removed all 15 bases. Nevertheless, they did have more than 100 bases too, something like 150. I wonder if any trimming at all will cause the new warn I am seeing in Sequence Length Distribution. I don't mind trial and error, but it will take me a whole day just to test it, and I don't feel like gambling my time on that at the moment. But I will test it one day when I could just leave my computer running at home while I'm out doing something else IRL.
 
Helpful fact:

If you upload your raw data txt file to ChatGPT 4.0 Code Interpreter, you can instruct it to make PED and MAP file used for Plink, and download it straight from the chat.
 
It is literally this easy to make a Plink file conversion from raw data with ChatGPT:

plEYvUo.png
 
I think I am running into a conversion issue with Plink to Eigenstrat because the versions are incompatible.

I have Plink 1.9 installed,

I had Eigensoft 5, but I think that was the issue. Thus I removed it.

Now I am going to install Eigensoft 6 instead, and hopefully that will fix the issue.

I was able to produce plink files of my genome aligned in HG19 format from the initial plink files yielded from the raw data, because of the software Liftover.
 
I really do not recommend using GPT 3.5 the model is garbage. It is probably also why most people think AI is not that accurate. I was using it since plus is capped. GPT 4.0 is far better suited, but you still need to watch it.
 
I attempted to use LiftOver to convert my HG38-aligned VCF to HG19. However, a significant portion of SNPs were deemed incompatible and were discarded in the process.


Given these challenges, I've decided to revert to my initial plan of directly producing an HG19-aligned dataset. Despite two warnings from the FASTQC analysis, I've verified that my FASTQ files are of exceptionally high quality, potentially surpassing the quality of many other FASTQ samples that are typically processed. Furthermore, trimming attempts have led to additional quality issues in other areas. Therefore, I believe proceeding with the current FASTQ files and focusing on quality control post-alignment is the best approach.

I should have my SAM complete by Tuesday, perhaps.
 
Back
Top