Friday, March 20, 2015

kw20_M3_C and hfqx_M0/M1 mixup

Take a look at this plot:

This is a principal component analysis (PCA) plot. It takes log normalized counts and plots samples on a 2D plane where each axis is a "principal component" that explains as much variance in the data as possible. The important thing to know is that similar samples will cluster in this plot. We see the different MIV timepoints clustering together in this case (only MIV samples were used to make this plot). A few things stick out though:
1) it looks like hfqx_M1 clusters strongly with the M0 samples; hfqxM0 clusters with M1 samples
2) similar thing seen with a couple of the cmnx samples (replicate C)
3) there appears to be two additional samples that are outliers (kw20_M3_C and cmnx_M2_C)
4) interestingly, it seems that all the crpx samples (except M0) cluster independent of timepoint

For hfqx, I have suspected a sample swap for a long time now; my original differential expression analysis had weird results and were corrected upon switching the samples. As of writing, I have renamed these samples on my harddrive but no where else.

The cmnx appears to be swapped too. Perhaps it would be best to throw out these samples and use the well-behaved ones if cmnx should be analyzed in the future.

The light blue (M2) cmnx outlier is the bad sample that I have brought up a few times now. This sample has very low counts and poor alignment. This sample should be thrown out from all analyses from now on.

The KW20 outlier is odd though.

Here we can see that the hfqx_M0 and  hfqx_M1 cluster in the wrong spots and would be corrected with a swap. The two cmnx samples (green arrow) that are swapped cluster in M0 and appear very similar to each other. We can also see that the ugly cmnx_M2_C sample and kw20_M3_C cluster separately at the top of the heatmap (purple arrow).

I went to the FastQC files I made a few months ago and looked at the kw20_M3_C sample:

I saw this terrifying graph. There's a few things wrong with it. Other samples have a nice, smooth bell-curve shape, however this plot is extremely jagged. Another problem is that the mean GC content is too high. The theoretical blue curve should peak closer to 40% for Haemophilus influenzae. Why do we see this?

Well, I looked to see if there was some contamination but some 98% of reads align to the KW20 genome. Then I thought it could be a ribosomal issue.

The FastQC file also reports overrepresented sequences. There's a lot for this sample. I blasted a few of them, in particular, this one:


This 50bp stretch is found in 73523 reads (1.4% of reads in the sample). To my expectation, it aligned to the 16s rRNA gene in the KW20 genome. I looked at how many reads aligned to the 16s rRNA sequence and in some regions, more than 1.5 million reads are aligning to a given base in this sample. This is roughly 5x more than the average for all other samples. I'll repeat that for clarity, it looks like KW20_M3_C has roughly 5x more rRNA than the average sample. This sample also has the most 16s rRNA among all samples.

In summary, it looks like this sample is clustering away from all other samples because a large portion of the reads are from rRNA. Similar to cmnx_M2_C, the genes in these samples have low counts and that might explain why they cluster together. I don't know whether or not I should continue to use this sample. It will take a little bit more investigation.

No comments: