NGS Quality Control Check

NGS Quality Control Check
A Comprehensive Bioinformatics Workshop
Master the essential techniques for ensuring data quality and reconstructing genomes from next-generation sequencing reads.
Foundation
Why Quality Control Matters
Quality control is the foundation of any successful NGS analysis. Without rigorous QC, you risk the classic "garbage-in-garbage-out" scenario where flawed input data produces unreliable results. QC catches systematic issues early, before they cascade through your entire analysis pipeline.
Proper quality assessment saves valuable computational resources by preventing you from processing data that will ultimately yield poor results. It ensures your findings are reproducible and defensible, and critically informs your trimming decisions.
Skipping QC is like skipping a foundation inspection before building a house — the consequences compound over time.
Public Data
SRA: Sequence Read Archive
What is SRA?
NCBI's primary archive for high-throughput sequencing data, storing raw sequencing reads from various platforms (Illumina, PacBio, Nanopore, etc.).
Why it matters:
A public repository for sharing NGS data, SRA enables reproducibility and is required by most journals for publication.
What's stored:
Raw FASTQ files, metadata about samples, experimental protocols, and sequencing platforms.
Accessing data:
Use SRA Toolkit (fastq-dump, fasterq-dump) to download data.
Utilize cloud-based access through AWS or GCP.
Data organization:
Data is structured by BioProject (study), BioSample (biological sample), and SRA Run (sequencing run).
Quality considerations:
Downloaded SRA data still requires quality control with FastQC before assembly — never assume public data is clean.
Practical tip: Always check data quality metrics before starting analysis, even for published datasets.
Understanding Quality Scores
Q20
1 in 100 errors
99% accuracy
Q25
1 in 316 errors
99.7% accuracy
Q30
1 in 1000 errors
99.9% accuracy
Q40
1 in 10,000 errors
99.99% accuracy
Typical Quality Profile
Quality scores follow a predictable pattern across read length. Scores start high at the 5' end where the sequencer is fresh, plateau in the middle region where chemistry is stable, then degrade at the 3' end as reagents deplete and polymerase efficiency drops.
This degradation is normal and expected — it's part of the sequencing chemistry.
Photobleaching - Flouroscent dyes fade over time.
Accumulation of errors - Small mistakes compound as you go deeper into the read
Understanding this degradation pattern is crucial for deciding where to trim your reads.
Analysis
FastQC Key Metrics
1
Per-Base Quality
Tracks quality scores across read length, revealing positional degradation patterns
2
Per-Sequence GC
Shows GC content distribution, which should match your organism's expected profile
3
N Content
Identifies uncalled bases where sequencer couldn't determine nucleotide (should be near 0%)
4
Duplication Levels
Measures read redundancy, indicating PCR bias or natural library characteristics
5
Adapter Content
Detects leftover sequencing adapters that must be removed before analysis
6
Sequence Length
Confirms expected read length distribution across your dataset
MultiQC: Aggregate Analysis
From Hundreds of Reports to One Dashboard
MultiQC transforms the tedious task of reviewing hundreds of individual FastQC HTML reports into an elegant, interactive dashboard experience. It automatically discovers and combines FastQC reports from all your samples, generating a unified view of your entire sequencing run.
The tool excels at cross-sample comparison, making outliers and batch effects immediately visible. You can quickly identify which samples deviate from expected patterns and investigate potential technical issues.
MultiQC exports statistics in machine-readable formats, enabling downstream analysis and record-keeping. One comprehensive report instead of 100+ separate HTML files!
Troubleshooting
Common QC Issues
High N Content
Sequencer couldn't confidently call bases at specific positions. Often indicates optical issues or chemistry problems. Action: Investigate sequencer logs and consider re-sequencing.
Multimodal GC Distribution
Multiple peaks in GC content suggest possible contamination from another organism. Action: BLAST suspicious sequences to identify contaminant source.
High Duplication Rates
Excessive identical reads may indicate PCR over-amplification or library complexity issues. Can be expected in targeted sequencing. Action: Evaluate in context of experimental design.
Quality Score Cliff
Sudden, dramatic quality drop suggests systematic instrument malfunction. Action: Contact sequencing facility and check for known issues with that run.
Persistently Low Base Quality
Overall poor quality across reads requires aggressive trimming or complete rejection. Action: Set stringent trimming parameters or request re-sequencing.
Remember: Always investigate issues thoroughly and document your findings and decisions for future reference.
Trimming: When & Why
The Trimming Decision
Trimming removes low-quality bases from read ends, improving downstream analysis accuracy. However, over-trimming wastes good data and can reduce mapping efficiency.
Quality-Based Guidelines
Q > 25 across entire read: No trimming needed — your data is excellent
Q 20-25 range: Apply moderate trimming to remove poorest bases
Q < 20 consistently: Aggressive trimming required, or discard reads entirely
Recommended Tools
Trimmomatic: Java-based, highly configurable, handles paired-end data well
Cutadapt: Python-based, excellent for adapter removal
Fastp: Ultra-fast C++ implementation with automatic parameter selection
Protocol
Complete QC Workflow
This systematic workflow ensures comprehensive quality assessment and improvement. Each step builds on the previous one, creating a documented trail of decisions. Step 6 is critical — always verify that trimming improved your data as expected. The final documentation step enables reproducibility and helps troubleshoot issues in future projects.