In this
episode, we’re going to take a look at the FastQC test results. FastQC runs 10
tests on every .fastq file and gives the data a “PASS”, “FAIL”, or “WARN” based
on some criteria. The FastQC manual
where I get the information from can be found here. Let’s just jump right into
it.
Note: Relevant document on the Google Drive:
Scott>Quality Analysis>fastqc_extraction.sh
Raw FastQC output will be put in Scott>Quality Analysis>fastqc
Per base sequence quality
167 PASS, 0
WARN, 25 FAIL
Shows the
median quality scorehttp://sensestrand.blogspot.ca/2015/01/basic-analysis-2-fastq-and-base-quality.html at every base position. Fails if the lower
quartile score for any base is less than 5 or if the median for any base is
less than 20.
- All second
read (R2) files from the old samples failed.
- sample
antx_M2_C was the only new sample to fail, happened on the first read (R1)
- all
failures happened because of a bad lower quartile score base in the last base
position:
Per
sequence quality scores
192 PASS, 0
WARN, 0 FAIL
Shows the
distribution of the mean quality score for every read. Warning is given if the
most frequently observed mean quality is less than 27 (0.2% error rate). All
passed.
Per
base sequence content
0 PASS, 0
WARN, 192 FAIL
Shows the
proportion of each base (ATGC) at every position across all reads. Fails if
difference between A and T or C and G is >20% at any position.
There’s
clearly something weird happening at the beginning. This pattern is strongly
consistent across all the sequences I looked at, even between old and new data.
I think I understand why, just wait for the answer until after the next test. .
.
Per
base GC content
0 PASS, 0
WARN, 192 FAIL
Plots GC
content for each base across all reads. Fails if any base differs from mean GC
by more than 10%.
We see
results here consistent with the previous test. Again, this graph is highly
consistent, even between old and new data. On the Illumina FAQ, they state: http://support.illumina.com/faqs.html
“Why is GC high in the first few bases?
It is
normal to observe both a slight GC bias and a distinctly non-random base
composition over the first 12 bases of the data. In genomic DNA sequencing, the
base composition is usually quite uniform across all bases, but in mRNA-Seq, the base composition is
noticeably uneven across the first 10 to 12 bases. We believe this effect is caused by the "not so random"
nature of the random priming process used in the protocol. This might
explain why there is a slight overall G/C bias in the starting positions of
each read. The first 12 bases probably represent the sites that were being
primed by the hexamers used in the random priming process. This is because the
random priming is not truly random and the first 12 bases (the length of two
hexamers) are biased towards sequences
that prime more efficiently. This is normal and expected.”
This paper http://nar.oxfordjournals.org/content/38/12/e131
also discusses the issue.
Basically,
we should expect these tests to fail because there’s a bias for GC sequences in
the first 12 positions. In fact, seeing this (and seeing how it’s so
consistent) should give us confidence in our data.
Per
sequence GC content
15 PASS, 149
WARN, 28 FAIL
Graphs the
distribution of GC content per read next to a theoretical distribution
calculated from the data. Deviation from the theoretical distribution causes
warning or failure.
The graphs
above show what the typical results look like for PASS/WARN/FAIL. There appears to be a set of GC rich reads
(circled peaks) that creates a wider theoretical distribution in the WARN and
FAIL samples. These could be unusually abundant GC-rich transcripts (PCR
duplicates or just highly expressed genes) that are from H. influenzae or it could be contamination.
I pursued
this idea and made the following graph:
This graph
shows a couple things. One is that the samples with the best GC distribution
were aligned best to the H. influenzae
reference genome. The second thing is that the old samples tended to fail the
test more than the new samples (and tend to not align as well). 45% of the old
samples fail and only 5% of the new ones do. This supports the idea I offered
in the previous post that the newer samples are less contaminated.
If
contamination is causing the test to fail in most of the cases, I would say
it’s not that big of a deal since more than 97% of the reads still align to the
reference genome (meaning contamination probably represents less than 3% of the
reads)
Per
base N content
168 PASS, 24
WARN, 0 FAIL
An N is used
in place of a nucleotide when Illumina cannot determine a base during
sequencing. This test raises a warning if any base position in a read has more
than 5% N’s. All samples with a warning are old samples and are the second read
(R2), similar to the first test. As you may guess, the last nucleotide is the
one failing the test – it appears that about 10-15% of the nucleotides in the
last position are N’s. This is not surprising considering the base quality in
this position is terrible.
Anyway, this
is not a problem. Moving on. . .
Sequence
Length Distribution
192 PASS, 0
WARN, 0 FAIL
This test
makes sure all the reads are the same length and will fail if even one base is not the correct length. Since everything passed, everything should be the same length (which is 101 bases for both old and new samples).
Sequence
Duplication Levels
0 PASS, 0
WARN, 192 FAIL
This counts
the number of duplicated reads and will fail if more than 50% are duplicated. I
have analyzed duplication in the previous post, all that needs to be said here
is that because this is an RNA-seq dataset, we expect a lot of duplication so
this result is unsurprising.
Overrepresented
sequences
0 PASS, 186
WARN, 6 FAIL
I will
address this one in the next post in this series.
Kmer
Content
0 PASS, 185
WARN, 7 FAIL
Detects if
there’s an enrichment of any particular 5-mer in the dataset. Due to bias in “random”
priming and high duplication rate, it’s likely that many 5-mers are
overrepresented. As such, I won’t take the time to look at this carefully. At
least not at this point.
2 comments:
Are the new sets of reads longer than the old ones? If so this could cause them to align better.
I just checked the base compositions of H. influenzae's ribosomal RNAs: 16S: 51.9%; 23S: 49.8%.
Post a Comment