Admixtools Utilizing AI for WGS30x FASTQ-Plink conversion & AADR Merge for Downstream Analysis in Admixtools2

Jovialis · Aug 17, 2023

Please defer to this thread:

Thread 'Modern FASTQ to Eigenstrat Format with AADR Merge'

Apr 1, 2025

This bioinformatics pipeline processes raw genomic data from paired-end FASTQ files to generate merged genotype data compatible with the AADR (Allen Ancient DNA Resource) dataset, enabling population genetics analysis. It begins by setting up a working environment and installing essential tools like samtools, bwa, fastp, Picard, pileupCaller, and EIGENSOFT. The pipeline then downloads and prepares the GRCh37 reference genome (with decoy sequences), adjusts its FASTA headers, and indexes it for alignment. Paired-end FASTQ files are trimmed and filtered using fastp, aligned to the reference...

Jovialis · Aug 17, 2023

Currently creating the SAM file, 24 hour benchmark is 100 GBs exactly which have been produced. Given the fact the FASTQ files combined are 105 GBs, the total amount for the SAM could be between 300-500 GBs. Thus it will likely take a few more days before it is complete. So between Saturday and Monday, I can expect it to be ready.

eupator · Aug 17, 2023

You don't have to convert the plink files to eigenstrat.

admixtools2 reads plink files.

Jovialis · Aug 17, 2023

eupator said:
You don't have to convert the plink files to eigenstrat.

admixtools2 reads plink files.

Thanks!

I just investigated the differences:

Both PLINK and Eigenstrat formats are compatible with AdmixTools, but they are used for different purposes and have some distinct features:

1. **Eigenstrat Format**:
- **Purpose**: Originally designed for the EIGENSOFT package to handle issues related to population stratification in genome-wide association studies (GWAS).
- **Components**: Eigenstrat format consists of three files: `.geno` (genotype data), `.snp` (SNP information), and `.ind` (individual/sample information).
- **Features**:
- Allows for the representation of missing data.
- Genotype data is represented in a more compact form compared to PLINK.
- Is the preferred format for some specific tools within the AdmixTools suite, especially when dealing with ancient DNA data or when there's a need to handle a lot of missing data.

2. **PLINK Format**:
- **Purpose**: PLINK format is designed for whole-genome association analysis. It's a widely-used format in various bioinformatics and genetics software.
- **Components**: PLINK format consists of three main files: `.bed` (binary genotype data), `.bim` (extended MAP file with SNP information), and `.fam` (family/individual information).
- **Features**:
- Binary format, making operations faster and storage efficient.
- Contains more detailed information about families and relationships.
- Can store phenotype information.
- Is more universally accepted across various bioinformatics tools.

**Differences in AdmixTools**:
- When using AdmixTools, the primary difference between the two formats lies in how the data is preprocessed and read into the tool.
- Some specific functionalities or operations within AdmixTools might prefer or require one format over the other. For example, certain analyses that require handling missing data might work better with the Eigenstrat format.
- In general, for most applications within AdmixTools, you can use either format. But the choice might be influenced by the specific needs of the analysis and the format of the datasets you're working with.

In conclusion, while both formats can be used with AdmixTools, you might choose one over the other based on the specific requirements of your analysis and the datasets at hand.

Q:Will the results be the same if set to the same parameters?

In theory, if both the PLINK and Eigenstrat data formats are correctly converted from the same original dataset and are subjected to the same parameters and preprocessing steps in AdmixTools, the results should be consistent between the two formats.

However, there are several factors to consider:

1. **Conversion and Preprocessing**: Differences can arise from how the data is converted and preprocessed. Ensure that any data filtering, quality control, or other preprocessing steps are consistent between the two formats.

2. **Handling of Missing Data**: Eigenstrat has a more explicit representation of missing data. If the original dataset has a significant amount of missing data, and the tools or parameters used handle missing data differently between the two formats, results might slightly differ.

3. **Tool Specificities**: Some tools or functionalities within the AdmixTools suite might have specific behaviors or optimizations based on the input format. Always refer to the specific tool's documentation to understand any nuances.

4. **Numerical Variability**: While the core algorithms should provide consistent results, minor numerical variations due to software implementation can sometimes occur, leading to very small differences in results.

In practice, for most analyses and with proper preprocessing, the differences (if any) between results obtained from PLINK and Eigenstrat formats should be minor and should not impact the overall conclusions. However, it's always a good practice to verify the consistency of results, especially when using different formats or software tools.

eupator · Aug 18, 2023

The results are similar if not identical.

I use plink mostly with 2 on windows.

Jovialis · Aug 18, 2023

It has been almost a full 48 hours since I started the BWA process, hopefully it will be ready by the time I get back on Sunday.

Here's a breakdown of percentage of time by each step. After BWA, I will have complete perhaps more than 70% of the process.

1. **Obtaining the Reference Genome**:
- Downloading: **1-2%**.

2. **Reference Genome Preparation**:
- Indexing the reference genome: **1-3%**.

3. **Alignment with BWA**:
- Aligning FASTQ files to the reference genome: **24-72%** (this is the most time-consuming part).

4. **SAM to BAM Conversion**:
- Conversion, sorting, indexing, marking duplicates: **2-5%**.

5. **Variant Calling**:
- **2-12%**.

6. **VCF to PLINK Conversion**:
- **1-2%**.

7. **Merging with the Reich Lab Dataset**:
- **1-3%**.

8. **PLINK to Eigenstrat Conversion**:
- **1%**.

eupator · Aug 18, 2023

Inconsistency errors are not taken into account in your work timeline.

Your merge will be a dud if you haven't trimmed.

Jovialis · Aug 18, 2023

eupator said:
Inconsistency errors are not taken into account in your work timeline.

Your merge will be a dud if you haven't trimmed.

The list is very high level, I saw there's multiple steps within in that are going to eliminate duplicates and other inconsistencies.

Jovialis · Aug 18, 2023

These are steps within those steps:

### SAM to BAM Conversion:

1. **SAM to BAM Conversion**:
- Command: `samtools view -bS input.sam > output.bam`
- This step converts the SAM file into a binary format called BAM. The binary format is more compact and faster to process.

2. **Sort the BAM File**:
- Command: `samtools sort input.bam -o sorted_output.bam`
- Before further processing, the BAM file needs to be sorted based on the genomic coordinates.

3. **Index the BAM File**:
- Command: `samtools index sorted_output.bam`
- Indexing the BAM file allows for quick random access, which is essential for many downstream applications.

4. **Mark Duplicates**:
- Using tools like Picard or Samtools.
- Command (Picard): `picard MarkDuplicates INPUT=sorted_output.bam OUTPUT=marked_output.bam METRICS_FILE=metrics.txt`
- Duplicate reads can arise during the PCR amplification step in the library preparation process. Marking them ensures they don't bias the variant calling process.

5. **(Optional) Local Realignment Around Indels**:
- Using tools like GATK's `IndelRealigner`.
- This step attempts to correct misalignments due to the presence of indels.

6. **(Optional) Base Quality Score Recalibration (BQSR)**:
- Using GATK's `BaseRecalibrator` and `ApplyBQSR`.
- This step recalibrates the quality scores of the bases, which can improve the accuracy of variant calling.

### Variant Calling:

1. **Call Variants**:
- Using tools like GATK's `HaplotypeCaller` or Samtools' `mpileup`.
- Command (GATK): `gatk HaplotypeCaller -R reference.fasta -I marked_output.bam -O output.vcf`
- This step identifies sites in the genome where your sample has a different base than the reference genome.

2. **(Optional) Variant Quality Score Recalibration (VQSR)**:
- Using GATK's `VariantRecalibrator` and `ApplyVQSR`.
- This step refines the quality scores of the variants, separating likely true variants from likely false ones.

3. **Filter Variants**:
- Using tools like GATK's `VariantFiltration` or BCFtools.
- Command (GATK): `gatk VariantFiltration -R reference.fasta -V output.vcf --filter-expression "QD < 2.0" --filter-name "lowQD" -O filtered_output.vcf`
- This step filters out variants based on various criteria to remove potential false positives.

4. **(Optional) Joint Genotyping**:
- If you're processing multiple samples, GATK's `GenotypeGVCFs` can be used to perform joint genotyping on all samples, improving variant calling accuracy.

These steps ensure a comprehensive approach to achieving high-quality BAM files and accurate variant calls. However, always remember to consult the documentation of each tool for specific guidelines and best practices.

Jovialis · Aug 18, 2023

^^I intend to go through each one to ensure I have a high-quality VCF.

Salento · Aug 18, 2023

I think that after the variant calling step, the VCF file needs to be annotated. Otherwise the vcf is still going to be huge and you won’t get the rsid.

…. Edit … nevermind

Jovialis · Aug 19, 2023

Back to the drawing board once again!

This time it was my WSL that maxed out!

So despite putting the WSL on my D drive, I didn't know the output get's put on it's own virtual drive separate from C and D. Which just happens to be about 250 GBs in my case, which isn't enough, thus it crashed.

So now I will make sure to account for that, and start all over again.

Jovialis · Aug 20, 2023

So there seems to be an error I get when installing eigensoft that can't be resolved with the lapacke file. I suspect it may have to do with the way I had to set up the virtual disk with Ubuntu. Nevertheless it is fine, because my first objective is to process the FASTQ to at least BAM format. The SAM is a massive hurdle, but once it is processed, I can uninstall everything and reinstall it with the outputs and file paths going to where it normally would. At the moment I need to direct everything to the D drive, because it has ample space.

Jovialis · Aug 20, 2023

Nebula does provide you with FASTQ, CRAM, BAM, VCF. But the issue is, it is not in HG19 format (the one the Reich lab uses). Thus, the FASTQ, the most seminal file, can have HG19 applied with BWA.

Jovialis · Aug 21, 2023

Jovialis said:
Back to the drawing board once again!

This time it was my WSL that maxed out!

So despite putting the WSL on my D drive, I didn't know the output get's put on it's own virtual drive separate from C and D. Which just happens to be about 250 GBs in my case, which isn't enough, thus it crashed.

So now I will make sure to account for that, and start all over again.

This iteration seems to be going well (Fingers crossed)

I set it up so the Virtual Disk is on the D drive. However, it is import to make sure the output is not sent to that Virtual Disk too (not enough storage in my case), and rather to a physical local drive that is big enough to handle 300-500 GBs. The one I am using has 828 GBs free, so it is sufficient. I've monitored the Virtual Disk, and it looks like the size has been the same, which is a good sign. All the while the SAM file is being generated on that actual D drive in a dedicated folder.

Should be ready in 3 to 5 days.

Jovialis · Aug 22, 2023

I decided to terminate the process, because I first need to conduct FASTQC and perform all of the cleaning of the data before alignment with other tools.

Nevertheless, at least now I know I can facilitate the process in terms of creating the sam.

I know for the purpose of ancestry analysis the difference would have been minimal (on the order of 0.1%). As well as most of the really important quality controls come post-alignment. However, I strive for accuracy, and it would really bug me moving forward.

Thus I will begin again.

Jovialis · Aug 22, 2023

Jovialis said:
I decided to terminate the process, because I first need to conduct FASTQC and perform all of the cleaning of the data before alignment with other tools.

Nevertheless, at least now I know I can facilitate the process in terms of creating the sam.

I know for the purpose of ancestry analysis the difference would have been minimal (on the order of 0.1%). As well as most of the really important quality controls come post-alignment. However, I strive for accuracy, and it would really bug me moving forward.

Thus I will begin again.

About a half-hour in thus far:

Code:

/mnt/d/FASTQ/Jovialis_1.fq.gz /mnt/d/FASTQ/Jovialis_2.fq.gz
Started analysis of Jovialis_1.fq.gz
Approx 5% complete for Jovialis_1.fq.gz

Since I have pair-ended FASTQs I will afterwards run them in MultiQC to aggregate the results appropriately.

Furthermore, I will also go with Trimmomatic to clean the FASTQ.

Jovialis · Aug 23, 2023

Results of FASTQC analysis for 1 of 2 pair-ended FASTQ files. The second FASTQ analysis is underway.

Jovialis · Aug 23, 2023

I will use FASTP to remove the first 15 bases that are out of wack probably due to primers and synthetic constructs. This should eliminate the warning, when I re-run FASTQC.

Jovialis · Aug 23, 2023

Just to be more precise, I had the AI analyze the fastqc_data file, which analyzes the genome's quality.

The FastQC report provides several metrics about the quality of your sequencing reads. Here's a breakdown of the relevant sections for trimming considerations:

1. **Per base sequence quality**:
- This section shows the quality scores across all positions in your reads. For each position, you'll see the average quality score, the median, the 10th and 90th percentiles, etc.
- As a general rule, positions with a mean quality score below 30 or a 10th percentile score below 20 are often considered for trimming, since they can represent unreliable base calls. However, your data seems to have high quality scores across the board, with all mean values exceeding 30 and the 10th percentile values exceeding 20 for the first 15 positions.

2. **Per base sequence content**:
- This section shows the proportion of each base (A, T, C, G) at each position across all reads. A warning in this section might be due to the presence of adaptors, non-random primers, or other contaminating sequences.
- In your report, the bases in the first few positions (especially 1 and 2) deviate significantly from the average GC content, indicating potential adapter or primer sequences. Given the unusual base proportions at positions 1 and 2, you might consider trimming these positions off.

3. **Per base N content**:
- This section shows the proportion of reads with 'N' at each position. 'N' denotes an undetermined base.
- In your report, the N content is extremely low across all positions, so this is not a concern.

4. **Per sequence GC content**:
- This gives a histogram of GC content across all sequences. A warning here might indicate contamination or a non-random subset of sequences.
- The warning in this section likely isn't relevant for trimming considerations, but it might be something you'd want to investigate further, especially if you're doing a de novo assembly or other analysis where unexpected GC content could be problematic.

5. **Sequence Length Distribution**:
- This section shows the distribution of sequence lengths in your dataset. All of your sequences are 100 bp long, so there's no variation to consider here.

6. **Sequence Duplication Levels**:
- This section shows the level of sequence duplication. It can be useful for identifying over-represented sequences, which might be artifacts or contaminants.
- Your dataset seems to have a majority of unique sequences (88.36% deduplicated), which is good.

7. **Adapter Content**:
- This section identifies any known adapter sequences in your reads.
- The report suggests minimal adapter contamination, which is good. However, if you suspect there might be adapter sequences not covered by FastQC's default set, you'd need to trim or filter them using tools that allow for custom adapter sequences.

**Recommendation**:
Given the data provided, I'd recommend trimming the first 2 bases from each read, as they show unusual base content which could be indicative of adapter or primer sequences. The rest of the read seems to be of high quality, and there's no strong evidence of adapter contamination in the later positions.

To perform this trimming, you can use tools like `Trim Galore!`, `Cutadapt`, or `Trimmomatic`. For example, using `Cutadapt`:

```bash
cutadapt -u 2 -o output_trimmed.fq Jovialis_1.fq.gz
```

This command trims the first 2 bases from each read in the input file `Jovialis_1.fq.gz` and writes the trimmed reads to `output_trimmed.fq`. Adjust the command as necessary for your environment and filenames.

Admixtools Utilizing AI for WGS30x FASTQ-Plink conversion & AADR Merge for Downstream Analysis in Admixtools2

Advisor

Advisor

destroyer of delusions

Advisor

destroyer of delusions

Advisor

destroyer of delusions

Advisor

Advisor

Advisor

Regular Member

Advisor

Advisor

Advisor

Advisor

Advisor

Advisor

Advisor

Advisor

Advisor