I’ve talked
about FastQC in a previous post but now I’ve actually use it to analyze our data. We want to make sure the data
we have is of sufficient quality before we do anything with it, so it’s
important to do this analysis early.
Note: I put all relevant documents on the Google Drive
in the folder Scott>Quality Analysis
Here’s what
I did:
1)
Run FastQC on all .fastq files (96 x 2 reads per
sample = 192 files)
2)
Run Sambamba
to align files to reference genome (to be discussed in future post)
3)
Run Sambamba
to extract alignment statistics
4)
Made a script that runs in the terminal to
extract important statistics from the output of 1) and 3) (uploaded to the
drive as fastqc_extraction.sh)
The output
of my script is called fastq_results.xlsx on the drive. On the first worksheet
of this file I have listed useful values and whether or not a sample passes or
fails the standard FastQC tests. The second worksheet is just a list of every
sequence FastQC has labelled as “overrepresented”.
The first
thing I did was compare some numbers between the old and new RNA seq datasets
in R. The script that makes this plot is called quality_analysis_script.R (and requires
the script multiplot.R)
Number of reads (i.e. number of DNA
fragments sequenced in each sample)
The graph
suggests that the older samples tend to have more reads. Older samples have, on
average, 1.5 million more reads. A t-test says that there is a significant
difference with a p-value of 1e-08. Although 1.5 million sounds like a lot,
keep in mind that some of the older samples differ by close to 6 million reads.
The top 4
files for reads are:
toxx_M0_A.R1.fq.gz
– 10651839
toxx_M0_A.R2.fq.gz
– 10651839
antx_M0_A.R1.fq.gz
– 11311458
antx_M0_A.R2.fq.gz
– 11311458
The bottom 4
files are:
antx_M2_C.R1.fq.gz – 1781164
antx_M2_C.R2.fq.gz – 1781164
antx_M2_E.R1.fq.gz
– 4035388
antx_M2_E.R2.fq.gz
– 4035388
Number of duplicates (i.e. reads that do not
have a unique sequence in the sample)
The graph
suggests that the newer samples tend to have more duplicated reads. Newer
samples have, on average, 4% more duplicates. A t-test says that there is a
significant difference with a p-value of 4e-11.
Duplicate
reads are caused by sequencing two PCR products derived from the same fragment
(technical duplicates) or by sequencing the same region of two distinct RNA
molecules (biological duplicates). Technical duplicates are unwanted whereas
biological duplicates are pretty much the basis of differential expression
analysis. It is impossible to discriminate the two types of duplicates with the data we have.
So it looks
like most samples fall above 80% duplicates – which sounds like a lot (only 20%
of reads are actually unique). However, remember that only one of the two paired
end reads is actually being analyzed here so all reads that begin at the same nucleotide
are considered duplicates. Because we’re doing RNA seq here, most reads come
from the transcriptome (a subset of the whole genome) and in some samples,
there are 3x more reads than nucleotides in the genome; extremely high duplication
(80-90%) is reasonable.
The top 4
files for duplicated % are:
kw20_M3_C.R1.fq.gz
– 89.1
toxx_M1_D.R1.fq.gz
– 88.6
kw20_M3_C.R2.fq.gz
– 88.0
toxx_M1_D.R2.fq.gz
– 87.9
The bottom 4
files are:
kw20_M0_B.R2.fq.gz
– 75.8
sxyx_M1_B.R2.fq.gz
– 75.5
antx_M2_C.R1.fq.gz – 60.4
antx_M2_C.R2.fq.gz – 58.7
Note the
bottom two. They have a lot of contaminated sequences, so it’s not surprising that
almost half of sequences are unique.
GC Percent (i.e. % of nucleotides that are G
or C)
There is on
average 1% more GC in the old data than the new (p=2e-08). H. influenzae has a GC % of 38% and Rosie has claimed that genes
tend to be higher. It seems that the older dataset is much more variable
(except for 4, 2 of which have been shown to be contaminated so far)
The top 4
files for GC % are:
kw20_M3_C.R1.fq.gz
– 46
kw20_M3_C.R2.fq.gz
– 45
antx_M2_C.R1.fq. gz – 45
antx_M2_C.R2.fq.gz – 45
The bottom 4
files are:
Most files
are lowest at 40%
% Alignment (i.e. % of reads that can be
mapped back to the H. influenzae genome)
Note that
for antx_M2_C, only 20% reads aligned to the reference genome. This sample was not included in the graph. On
average, reads from the new samples get aligned better than reads from the old
sample by 0.5%. This is significant (p=1e-10) once the poorly aligned sample is
removed.
This means
that the newer samples could be purer or the quality of sequencing was higher
in the new sample (base calling errors make it harder to align back to a
reference sequence) or maybe something else I haven’t thought of.
The top 4
files for GC % are:
sxy1_B2_G.R1.fq.gz
– 99.42
sxy1_B2_G.R2.fq.gz
– 99.42
sxyx_M0_D.R1.fq.gz
– 99.43
sxyx_M0_D.R2.fq.gz
– 99.43
The bottom 4
files are:
antx_M2_C.R1.fq.gz – 20.1
antx_M2_C.R2.fq.gz – 20.1
toxx_M2_A.R1.fq.gz
– 97.1
toxx_M2_A.R2.fq.gz
– 97.1
Overall:
-The old
data and the new data are different.
Statistics can tell me that, however I don't see any major problem. Sample antx_M2_C is what I would expect a problem would look
like. When samples are prepared by different people on different days, there is
bound to be technical variation.
- Sample antx_M2_C is bad. I feel it may cause problems to keep
in in any future analysis steps (but we can try anyway)
- Looking at
the graphs, my prediction is that the newer data is either less contaminated or contains
more PCR duplicates, because the graphs suggest that there is more H. influenzae DNA in the newer sample
(better alignment, GC content has low
variability)
So far, this
analysis suggests that the new data is probably as good as (if not better) than the old
data. In the next post I will look at the FastQC test results.
3 comments:
You wrote "Note that one sample was aligned to only 20% of the genome and is not on the graph." This doesn't make any biological sense and it doesn't fit its context either, so I think you probably mean that 20% of the reads aligned to the genome.
If this is the correct interpretation then we shouldn't discard the data without more measures of whether or not it will be informative.
I don't think we know the 'expected value' for %G+C of the pooled mRNAs.
1)The way I worded the "aligned to only 20% of the genome" was indeed awkward, and I've updated the post to correct it.
2)I'm not discarding that sample (yet), I just didn't include it in that graph because it caused scaling problems to the graph as a whole.
3)You're right, we don't know what the GC% of the transcriptome is. Intuitively, 40% is what I would expect but I realize now I really have no evidence to support that.
Post a Comment