Thursday, April 9, 2015

What's wrong with the RNAseq reads for sample kw20_M3_C?

In his March 20 post, Scott pointed out that the kw20_M3_C sample appears to be very odd.  This is a replicate of samples kw20_M3_A and kw20_M3_B; they are all the wildtype strain KW20 after 100 minutes in the MIV competence induction medium.

The 'C' replicate doesn't cluster with the others in Scott's principal component analysis (see figures in Scott's post), and its quality control analysis gave very weird results, with sharp spikes of reads with much higher G+C than expected for H. influenzae.   But 98% of the reads aligned to the H. influenzae genome, so these high-GC reads aren't due to contamination. Instead they must be very high coverage of GC-rich parts of the genome.


The red line shows the distribution of %G+C in the reads of this sample.  The genome is 38% G+C, and from the QC analysis of the other samples we expect to see a smooth red bell curve peaking at about 40-41%, with some minor spikes at the high G+C end.  Instead we see a series of jagged peaks, one at 38%, one at 51% and one at 55%, with a smaller peak at 60%.  From the heights and widths of the peaks I'd estimate that about half of the reads have anomalously high %G+C.

Scott decided that many of the reads must be ribosomal RNA, but further analysis suggests that they are other abundant high-G+C noncoding RNAs, not rRNAs.

Many of the reads in the 'Overrepresented sequences' list are 'transfer-messenger RNA' (tmRNA).  This ~350 nt RNA has important functions in translation.  The gene is at 1357929-1357564; and looking at the nucleotide coverage of this region in our RNAseq sample (shows that more than 2,000,000 of the reads are of this gene.  Even more reads are from RNase P.  This 360 nt RNA ribozyme's gene is at 1746450-1746790: Coverage over most of this region is more than 1,000,000, so I estimate that about 3,000,000 of our reads are of RNase P.  This analysis would suggest that about 50% of the ~10,000,000 reads from this sample are of RNase P and tmRNA. 

To confirm this I directly checked whether half the reads in the sample do have anomalously high %G+C.  I pulled out the first 28 reads from the original fastq file and directlychecked their %G+C.  Thirteen of these have the low G+C (<41 expected="" i="" most="" of="">H. influenzae
mRNAs.  The rest have a range of very high G+C (47.5-59.3%), and BLAST tells me that all but three of these reads are either RNase P or tmRNA.  One of the others is 16S rRNA, and the other two are mRNAs that are also abundant in the replicate samples.  This suggests that about half of the reads are indeed from RNase P and tmRNA.

Why would one sample have this problem?  Perhaps it's just that much of the mRNA was degraded to fragments so small that they were lost in the library creation step?  RNase P and tmRNA molecules might have survived because of their highly folded structures.  To test this hypothesis, the Co-op tech is looking through the RNAseq notebook to find the gel photos that go with these RNA preps.

2 comments:

Scott M. said...

The degradation hypothesis is reasonable. Would this suggest that most RNAs that are not RNAse P and tmRNA are short? I could verify this.

Do we have bio-analyzer results for this sample?

Rosie Redfield said...

It might suggest that this sample would have more short reads tht the other samples.

I don't trust any of our Bioanalyzer results.