Sunday, January 11, 2015

An in-depth look at data quality (Part 2)



In this episode, we’re going to take a look at the FastQC test results. FastQC runs 10 tests on every .fastq file and gives the data a “PASS”, “FAIL”, or “WARN” based on some criteria. The FastQC manual where I get the information from can be found here. Let’s just jump right into it.

Note: Relevant document on the Google Drive: Scott>Quality Analysis>fastqc_extraction.sh
Raw FastQC output will be put in Scott>Quality Analysis>fastqc

Per base sequence quality
167 PASS, 0 WARN, 25 FAIL

Shows the median quality scorehttp://sensestrand.blogspot.ca/2015/01/basic-analysis-2-fastq-and-base-quality.html   at every base position. Fails if the lower quartile score for any base is less than 5 or if the median for any base is less than 20.

- All second read (R2) files from the old samples failed.
- sample antx_M2_C was the only new sample to fail, happened on the first read (R1)
- all failures happened because of a bad lower quartile score base in the last base position:


No significant problems here.


Per sequence quality scores
192 PASS, 0 WARN, 0 FAIL

Shows the distribution of the mean quality score for every read. Warning is given if the most frequently observed mean quality is less than 27 (0.2% error rate). All passed.



Per base sequence content
0 PASS, 0 WARN, 192 FAIL

Shows the proportion of each base (ATGC) at every position across all reads. Fails if difference between A and T or C and G is >20% at any position.

There’s clearly something weird happening at the beginning. This pattern is strongly consistent across all the sequences I looked at, even between old and new data. I think I understand why, just wait for the answer until after the next test. . .

Per base GC content
0 PASS, 0 WARN, 192 FAIL

Plots GC content for each base across all reads. Fails if any base differs from mean GC by more than 10%.



We see results here consistent with the previous test. Again, this graph is highly consistent, even between old and new data. On the Illumina FAQ, they state: http://support.illumina.com/faqs.html

“Why is GC high in the first few bases?
 It is normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. In genomic DNA sequencing, the base composition is usually quite uniform across all bases, but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. We believe this effect is caused by the "not so random" nature of the random priming process used in the protocol. This might explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. This is because the random priming is not truly random and the first 12 bases (the length of two hexamers) are biased towards sequences that prime more efficiently. This is normal and expected.”
This paper http://nar.oxfordjournals.org/content/38/12/e131 also discusses the issue.

Basically, we should expect these tests to fail because there’s a bias for GC sequences in the first 12 positions. In fact, seeing this (and seeing how it’s so consistent) should give us confidence in our data.

Per sequence GC content
15 PASS, 149 WARN, 28 FAIL

Graphs the distribution of GC content per read next to a theoretical distribution calculated from the data. Deviation from the theoretical distribution causes warning or failure.





The graphs above show what the typical results look like for PASS/WARN/FAIL.  There appears to be a set of GC rich reads (circled peaks) that creates a wider theoretical distribution in the WARN and FAIL samples. These could be unusually abundant GC-rich transcripts (PCR duplicates or just highly expressed genes) that are from H. influenzae or it could be contamination.

I pursued this idea and made the following graph:

 

This graph shows a couple things. One is that the samples with the best GC distribution were aligned best to the H. influenzae reference genome. The second thing is that the old samples tended to fail the test more than the new samples (and tend to not align as well). 45% of the old samples fail and only 5% of the new ones do. This supports the idea I offered in the previous post that the newer samples are less contaminated.

If contamination is causing the test to fail in most of the cases, I would say it’s not that big of a deal since more than 97% of the reads still align to the reference genome (meaning contamination probably represents less than 3% of the reads)

Per base N content
168 PASS, 24 WARN, 0 FAIL

An N is used in place of a nucleotide when Illumina cannot determine a base during sequencing. This test raises a warning if any base position in a read has more than 5% N’s. All samples with a warning are old samples and are the second read (R2), similar to the first test. As you may guess, the last nucleotide is the one failing the test – it appears that about 10-15% of the nucleotides in the last position are N’s. This is not surprising considering the base quality in this position is terrible.

Anyway, this is not a problem. Moving on. . .
Sequence Length Distribution
192 PASS, 0 WARN, 0 FAIL

This test makes sure all the reads are the same length and will fail if even one base is not the correct length. Since everything passed, everything should be the same length (which is 101 bases for both old and new samples).

Sequence Duplication Levels
0 PASS, 0 WARN, 192 FAIL

This counts the number of duplicated reads and will fail if more than 50% are duplicated. I have analyzed duplication in the previous post, all that needs to be said here is that because this is an RNA-seq dataset, we expect a lot of duplication so this result is unsurprising.

Overrepresented sequences
0 PASS, 186 WARN, 6 FAIL

I will address this one in the next post in this series.

Kmer Content
0 PASS, 185 WARN, 7 FAIL

Detects if there’s an enrichment of any particular 5-mer in the dataset. Due to bias in “random” priming and high duplication rate, it’s likely that many 5-mers are overrepresented. As such, I won’t take the time to look at this carefully. At least not at this point.

2 comments:

Rosie Redfield said...

Are the new sets of reads longer than the old ones? If so this could cause them to align better.

Rosie Redfield said...

I just checked the base compositions of H. influenzae's ribosomal RNAs: 16S: 51.9%; 23S: 49.8%.