Sunday, January 11, 2015

An in-depth look at data quality (Part 3)

In this part, I will look at sequences that are overrepresented according to FastQC. The software flags any 50-mer sequence that makes up more than 0.1% the reads in the fastq files. If there is an abundance of rRNA sequences or if there is a large, redundant contaminant then this test will pick it up.

File that store overrepresented sequences in on Google Drive: Scott>Quality Analysis>fastqc_overrepresented.xlsx

First thing to note is that all overrepresented sequences are either a sequencing artifact or match some region of the H. influenzae genome (i.e. no overrepresented contaminant).

Here is what the abundant sequences matched to:

HI_0139 - outer membrane protein
Tended to be overrepresented in hfq, toxin, antitoxin and toxin/antitoxin knockouts in MIV at t=10 and t=30
Note: this is the only overrepresented gene that encodes a protein product

Between HI_1677-HI_1678 - RNase P
Overrepresented in every sample

Between HI_1281-HI_1282 - tmRNA
Overrepresented in most samples, no apparent pattern.

Between HI_0957-HI_0958 - C4 antisense RNA
Overrepresented in many samples (not a single hypercompetent sample though)
Note: the gene sits right next to CRP

Between HI_0857-HI_0858 - 6S RNA
Overrepresented in the old KW20 samples at t=100 in MIV (and one sxyx sample at the same timepoint)

Between HI_0631-HI_0632 -  tRNA Thr
Not sure why only this tRNA showed up. Overrepresented in the old KW20 samples at t=100 in MIV.

The resulting hits tend to have the following characteristics:
-RNA-based function
-transcribed from a small gene
-would expect to be highly expressed

Intuitively, I suppose this makes sense because a highly transcribed gene has an abundance of transcripts floating around and a smaller transcript gives rise to a smaller range of possible reads.

Anyway, the take-home message of this is that nothing strange is seen here. There were no hits for rRNA or any organism outside of H. influenzae. This is good.

So, this wraps up my in-depth look at the data quality. Ignoring a particularly poor antitoxin knockout sample, I have yet to see anything that would suggest that the data quality is poor - which is great. My next goal is to verify that the mutant strains are actually carry a mutation and that there is no mixup between timepoints (although I have some evidence this is the case for at least a pair of samples).

1 comment:

Rosie Redfield said...

I don't understand this analysis. Overrepresented according to what criterion? Overrepresented relative to what expectation?