Utilizing AI for WGS30x FASTQ-Plink conversion & AADR Merge for Downstream Analysis in Admixtools2

Sam complete.

Total time to process: 77.7 hours
gWyak4w.png
 
BAM yield from SAM was 117 GBs. I started the process yesterday at 11:25 am, and it ended sometime early this morning, while I was sleeping.

Here are computation estimates for time on the remaining steps, but right now I'm going to give my PC a break:

TBTZceB.png
 
1-3 hours, what a joke. Sorting a 117 GB BAM is taking me until around 6 pm tomorrow, and I started 11:30 AM this morning.

Basically what happens is samtools creates hundreds of temp files (413 in my case), and then merges them all into a sorted BAM.

The AI is good for coding prompts, but it can mess up on estimating time occasionally.
 
Last edited:
Also, monitoring the progress of the file in Powershell is the only way I can tell how the process is coming along. In the Ubuntu terminal, nothing outputs after initiating the command; which is normal.

Code:
$sizeBytes = (Get-Item "D:\UbuntuJovialisHome\Jovialis_sorted.bam").length; $sizeGB = [math]::Round($sizeBytes / 1GB, 2); Write-Output "File size: $sizeGB GB"
 
J0iseBt.png


Marked duplicates which I will filter out at the VCF processing phase. Thus far the quality control results have been phenomenal! 95 out of 100 in overall quality. That is based on assessing all of the QC reports I have produced thus far. I still need to run samtools stats for Aligned Reads QC, but I will be traveling today, and will have to wait to process that as it can take many hours.
 
If you want some feedback, you are wasting your time.

The highest SNP count possible you'll get with the Reich data is 1.24mil SNPs. You can hit that number easy if you extract a raw file from your BAM file.

Then you can convert with plink.
I decided to table my endeavor since the VCF phase is too complicated for me to figure out at the moment. My QC reports show there is very high quality coverage, but since I am not sure what parameters to set in the prompt, I ended up generating a 350 GB mpileup! I still have it on an external hard drive, but this is too large to work with.

I think removing the indels would be conducive to creating a smaller file, but because the quality of the data is pretty high, I am not sure what should be cut. Which is the issue since it makes the mpileup a monster. I could set it to be very stringient, but I don't want to risk losing low-quality but legit data.

Thus, I decided to take the sorted bam and run it through WGSExtract to generate a raw data which I will put into 23andme_V5 format.

Nevertheless this endeavor was necessary for me, at least from FASTQ to BAM since I needed to re-algin the Genome reference to HG19 to be compatible with the Reich Lab data.

Prior I tried to Liftover the HG38 formated raw data to H19, but only a fraction of the SNPs were able to be converted. Thus, it was necessary for me to re-algin in HG19. Ergo, it was indeed a fruitful endeavor.

Once the raw data is created, I'll convert it over to plink.
 
Last edited:
I decided to table my endeavor since the VCF phase is too complicated for me to figure out at the moment. My QC reports show there is very high quality coverage, but since I am not sure what parameters to set in the prompt, I ended up generating a 350 GB mpileup! I still have it on an external hard drive, but this is too large to work with.

I think removing the indels would be conducive to creating a smaller file, but because the quality of the data is pretty high, I am not sure what should be cut. Which is the issue since it makes the mpileup a monster. I could set it to be very stringient, but I don't want to risk losing low-quality but legit data.

Thus, I decided to take the sorted bam and run it through WGSExtract to generate a raw data which I will put into 23andme_V5 format.

Nevertheless this endeavor was necessary for me, at least from FASTQ to BAM since I needed to re-algin the Genome reference to HG19 to be compatible with the Reich Lab data.

Prior I tried to Liftover the HG38 formated raw data to H19, but only a fraction of the SNPs were able to be converted. Thus, it was necessary for me to re-algin in HG19. Ergo, it was indeed a fruitful endeavor.

Once the raw data is created, I'll convert it over to plink.


23ame V3 format has the highest coverage, you should use that or the all_SNPs option, the latter has some glitches I've noticed so v3 is the best (imo).
 
23ame V3 format has the highest coverage, you should use that or the all_SNPs option, the latter has some glitches I've noticed so v3 is the best (imo).

I keep running into this weird issue when trying to create the PED and MAP files in Plink 1.9:

Error: Chromosomes in Jovialis_sorted__CombinedKit_All_SNPs_RECOMMENDED.txt
are out of order.


I tried it also with V3, but I get the same issue. Not sure what the problem is, because I simply ran it through WGSextract.

Any advice on how to fix it? I can confirm it is correctly aligned to HG19 format. The raw data file works fine in calculators like MTA for example.
 
Last edited:
I keep running into this weird issue when trying to create the PED and MAP files in Plink 1.9:

Error: Chromosomes in Jovialis_sorted__CombinedKit_All_SNPs_RECOMMENDED.txt
are out of order.


I tried it also with V3, but I get the same issue. Not sure what the problem is, because I simply ran it through WGSextract.

Any advice on how to fix it? I can confirm it is correctly aligned to HG19 format.
try ... assuming that WGSExtract liftover 19 to 37 (if not delete all the "chr" ... chr1 should be 1 and so on), convert the Combined or the 23v3 raw-data to 23andme format with DNA Kit Studio.
After converting to plink, sometimes I need to replace the -9 in the .fam file with 1, before recoding bed to ped,
example: change Latin R850 0 0 0 -9 to Latin R850 0 0 0 1
... my convertf version doesn't like -09 ... :)
 
Last edited:
try ... assuming that WGSExtract liftover 19 to 37 (if not delete all the "chr" ... chr1 should be 1), convert the Combined or the 23v3 raw-data to 23andme format with DNA Kit Studio.
After converting to plink, sometimes I need to replace the -9 in the .fam file with 1, then "recode" bed to ped,
example: change Latin R850 0 0 0 -9 to Latin R850 0 0 0 1
... my convertf version doesn't like -09 ... :)
Actually I just tried it for 23andme_V5 and it works. I created the plink files , bim, fam, and bed.

7861 MB RAM detected; reserving 3930 MB for main workspace.
--23file: /mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.bed +
/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.bim +
/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.fam written.
Inferred sex: male.


Maybe it didn't work because V3 and Combined format are not compatible with Plink 1.9?
 
Here's the prompt incase it is of use to anyone:

Code:
plink --23file "/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5.txt" --out "/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK"

Actually I just tried it for 23andme_V5 and it works. I created the plink files , bim, fam, and bed.

7861 MB RAM detected; reserving 3930 MB for main workspace.
--23file: /mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.bed +
/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.bim +
/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.fam written.
Inferred sex: male.


Maybe it didn't work because V3 and Combined format are not compatible with Plink 1.9?
 
Actually I just tried it for 23andme_V5 and it works. I created the plink files , bim, fam, and bed.

7861 MB RAM detected; reserving 3930 MB for main workspace.
--23file: /mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.bed +
/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.bim +
/mnt/d/Raw_Data_Jovialis/Jovialis_sorted_23andMe_V5_PLINK.fam written.
Inferred sex: male.


Maybe it didn't work because V3 and Combined format are not compatible with Plink 1.9?
v3 is compatible with plink, ... maybe there are "unrecognized characters" in the chromosome column of the raw-data file, ... if so it needs some editing ... (v5 is smaller than v3, and maybe it doesn't carry the lines with the errors).
... I would begin by converting the v3 or the comb raw-data with DNA kit Studio to 23andme (no raw-data templates at first).
 
Last edited:
Your command should be

plink --allow-no-sex --23file jovialisrawdnafile.txt --make-bed --out jovialis

After that you need to rename your sample which will be called ID001 in the .fam file (use a code editor) then filter.
 
still getting the same errors.

I am going to have WGSextract try to process the seminal BAM file that was made from the SAM file. That way we could see if there was any issues processing after sorting, indexing, and duplication marking. However the QCs were great, and the processes were successful. Nevertheless, perhaps it is something idiosyncratic to WGSextract. I am using WGSextract Beta 2 (windows), since I have the most experience with that version.
 
still getting the same errors.

I am going to have WGSextract try to process the seminal BAM file that was made from the SAM file. That way we could see if there was any issues processing after sorting, indexing, and duplication marking. However the QCs were great, and the processes were successful. Nevertheless, perhaps it is something idiosyncratic to WGSextract. I am using WGSextract Beta 2 (windows), since I have the most experience with that version.
I get this error if I run the original bam from the sam (unindexed, unsorted, unmarked)

Betav2b_full\WGSExtract\temp\get_samtools_stats.sh"bash.exe: warning: could not find /tmp, please create!samtools idxstats: fail to load index for "D:/Experiment/Jovialis.bam", reverting to slow methodsamtools idxstats: file is not position sortedsamtools idxstats: failed to process "D:/Experiment/Jovialis.bam"

Now instead I will take the one that is fully sorted, indexed, and marked, and have WGSextract create the BAI, and see if that would help with the Out of order chromosome issue.

If not, would a more advanced version of WGSextract work perhaps?
 
I think I may have done it, I had the AI convert the sorted indexed, marked 23andme_V3.txt into a vcf, then I made the PED, MAP, and subsequent Plink files in Plink 1.9:

ChatGPT:
Great, it looks like the PLINK command executed successfully and generated the binary PLINK files: `Jovialis_binary.bed`, `Jovialis_binary.bim`, and `Jovialis_binary.fam` in your `/mnt/d/Experiment/` directory. The log also indicates that 959,286 variants and 1 individual (you) were processed.

A few notes from the log:
- It mentions "Ambiguous sex ID" because we didn't specify the sex in the VCF file. You can usually add this information later or specify it using a `.fam` file if needed.
- "Total genotyping rate is exactly 1" indicates that all variants had genotype information, which is good.
- "No phenotypes present" is expected since we didn't include phenotype information.

You can now use these PLINK binary files for your downstream analyses.

Is there anything else you'd like to do or any other questions you have?
 

This thread has been viewed 8268 times.

Back
Top