Friday, March 20, 2015

Planning more RNAseq runs

We need to decide what additional RNAseq runs we're going to do, given the papers we'd like to write and the limited resources we have.  This post attempts to clarify the issues, following on a long discussion with Josh about this yesterday afternoon.

Here's a screenshot of the latest version of the table showing both what we've done and the runs we might still do.

We anticipate using this data in several papers:

1. A paper about the toxin/antitoxin genes (authors Hailey, Rosie, Josh, Sunita, Scott?):  This paper will consider the gene expression data, the phylogeny, and the phenotype of knockouts in H. influenzae and A. pleuropneumoniae.  For it we will directly need three MIV time-course replicates of KW20, and the toxin, antitoxin and double knockout strains (toxx, antx, taxx in the table above).  We will also need to have identified the set of Sxy-dependent competence-induced genes, which will use the three replicates of the sxy-knockout time-course (sxyx).  We can control for any effects of the spcR cassettes in the 'marked' strains by analyzing data we have.

We therefore need to still do:

  • 2 replicates of the antx (unmarked) MIV time-courses to replace those that t=urned out to be the comN knockout (cmnx).
  • At least 1 replicate of the toxx (unmarked) MIV time course, to complement the 1 unmarked and 2 marked we already have.

2. A paper about the role of hfq in competence (authors Scott, Rosie, Josh, Sunita?):  This paper will consider the phenotype of the hfq knockout and its effect on gene expression and cAMP levels.  For it we will need the three KW20 MIV time courses, the 3 sxyx time courses to identify Sxy-dependent genes, and 2 crpx and hfqx MIV time-courses. We need to sort out ow cAMP levels change - Scott has done beta-gal assays. We may also need to identify the small RNAs that Hfq might interact with, unless this information is available from published hfq or other papers.

We therefore still need to do:

  • At least 1 replicate of the hfqx MIV time-course
  • At least 1 replicate of the crpx MIV time-course
  • Maybe 2 replicates of one or more small-RNA samples (RNAs prepared with Trizol rather than with Qiagen RNeasy columns).

3. A paper about the rpoD hypercompetence mutation (authors Lauri, Rosie, Josh?): This paper will report the identification of a hypercompetence-causing point mutation in rpoD (maybe more mutations if Lauri's work pans out).  In the rpoD mutant in sBHI, sxy RNA is up 1.6-fold at B1, 3.3-fold at B2, and 1.8-fold at B3. We will definitely need another replicate of these samples.  Lauri's analysis of sxy-beta-gal promoter fusions also found that sxy transcription is up in this mutant.  There aren't very many dramatic changes in other genes, and without any replication it's very hard to tell what might be significant.

We need:

  • at least one replicate of the rpod sBHI samples

4. A paper about identification of HI1457/HI1456 as a competence-induced operon. (authors ???; this result might be in the same paper as 3): These RNAs are up strongly in all three hypercompetence mutants in sBHI, and 1.6-2-fold in wildtype cells in MIV.  HI1457 (291 aa) is tagged as 'opacity protein' and HI1456 (179 aa) as 'hypothetical protein'. (Note to self: this is not Hanna Tomb's por gene (dsbA homolog? HI0846).)  Are they CRP-S regulated?  They are way down (8-12-fold) in all the crpx MIV time-course samples, including the M0 sample, which is OD=0.2 in sBHI.  But, oddly, they are down only 1.6-2-fold in the other crpx samples in sBHI (OD=0.02 and OD=0.6) and in the sxy knockout in sBHI and MIV.

5. A paper about regulation of competence (or as part of one of the other papers):  Levels of sxy RNA are up 2-3 fold (highly significant) in the sxy1 hypercompetent mutant samples in sBHI (1.8-fold in B1, 3.1-fold in B2, and 2.7-fold in B3).  This is unexpected under the model that these mutations just affect translatability.  In the KW20 MIV time-courses, sxy RNA is up 11.9-fold at M1, 31.2-fold at M2, and 5.1-fold at M3.

6. A paper about the murE749 hypercompetence mutant: (or as part of one of the other papers):  In the murE749 mutant in sBHI, sxy RNA is up 2.2-fold at B1, 3.1-fold at B2, and 2.1-fold at B3.  These increases aren't significantly higher than those in the rpoD mutant, but the increase in competence is much more dramatic (100-fold more than rpoD753, 1000-or-more higher than KW20).  So the difference in competence between murE749 and rpoD753 must be due to something other than the increase in sxy RNA.  We have two replicates of the sBHI samples - another would be nice but isn't a very high priority.

kw20_M3_C and hfqx_M0/M1 mixup

Take a look at this plot:

This is a principal component analysis (PCA) plot. It takes log normalized counts and plots samples on a 2D plane where each axis is a "principal component" that explains as much variance in the data as possible. The important thing to know is that similar samples will cluster in this plot. We see the different MIV timepoints clustering together in this case (only MIV samples were used to make this plot). A few things stick out though:
1) it looks like hfqx_M1 clusters strongly with the M0 samples; hfqxM0 clusters with M1 samples
2) similar thing seen with a couple of the cmnx samples (replicate C)
3) there appears to be two additional samples that are outliers (kw20_M3_C and cmnx_M2_C)
4) interestingly, it seems that all the crpx samples (except M0) cluster independent of timepoint

For hfqx, I have suspected a sample swap for a long time now; my original differential expression analysis had weird results and were corrected upon switching the samples. As of writing, I have renamed these samples on my harddrive but no where else.

The cmnx appears to be swapped too. Perhaps it would be best to throw out these samples and use the well-behaved ones if cmnx should be analyzed in the future.

The light blue (M2) cmnx outlier is the bad sample that I have brought up a few times now. This sample has very low counts and poor alignment. This sample should be thrown out from all analyses from now on.

The KW20 outlier is odd though.

Here we can see that the hfqx_M0 and  hfqx_M1 cluster in the wrong spots and would be corrected with a swap. The two cmnx samples (green arrow) that are swapped cluster in M0 and appear very similar to each other. We can also see that the ugly cmnx_M2_C sample and kw20_M3_C cluster separately at the top of the heatmap (purple arrow).

I went to the FastQC files I made a few months ago and looked at the kw20_M3_C sample:

I saw this terrifying graph. There's a few things wrong with it. Other samples have a nice, smooth bell-curve shape, however this plot is extremely jagged. Another problem is that the mean GC content is too high. The theoretical blue curve should peak closer to 40% for Haemophilus influenzae. Why do we see this?

Well, I looked to see if there was some contamination but some 98% of reads align to the KW20 genome. Then I thought it could be a ribosomal issue.

The FastQC file also reports overrepresented sequences. There's a lot for this sample. I blasted a few of them, in particular, this one:


This 50bp stretch is found in 73523 reads (1.4% of reads in the sample). To my expectation, it aligned to the 16s rRNA gene in the KW20 genome. I looked at how many reads aligned to the 16s rRNA sequence and in some regions, more than 1.5 million reads are aligning to a given base in this sample. This is roughly 5x more than the average for all other samples. I'll repeat that for clarity, it looks like KW20_M3_C has roughly 5x more rRNA than the average sample. This sample also has the most 16s rRNA among all samples.

In summary, it looks like this sample is clustering away from all other samples because a large portion of the reads are from rRNA. Similar to cmnx_M2_C, the genes in these samples have low counts and that might explain why they cluster together. I don't know whether or not I should continue to use this sample. It will take a little bit more investigation.

Thursday, March 19, 2015

Does having a spec cassette make a difference?

I was asked to check if the strains with the spec cassette have anything unusual in terms of gene expression.

First I needed to determine which samples had the spec cassette. To do this, I ran a grep command on the FASTQ files of may samples and looked for the string "GGGGATCCGTCGACCTGCAGTTCGAAG" which is a sequence in the spec cassette.

gunzip -cd filename.fq.gz | grep GGGGATCCGTCGACCTGCAGTTCGAAG

The following samples had reads that match this string:

The following samples did not have reads that match this string:
everything else

I checked every sample for strains in red and they were all consistent within a given strain. I only checked a single condition for everything else and not a single match was found in these.

For these marked strains, I pulled out all the genes that were significantly different compared to KW20 under the same condition. (ie. cmnx_M0 vs kw20_M0, antx_M1 vs kw20_M1, etc). Among these 12 samples, there was a total of 222 differentially expressed genes compared to KW20 (190 unique genes)


Only one gene was found differentially expressed in more than 3 samples. HI_0658 (ATP-binding cassette, subfamily F, member 3) was found differentially expressed in 4 (33%) of the samples, however all four samples were antx so this is not a spec-specific effect.

A few genes (~10 in total) show up in 3 samples. HI_0062 (dnaK suppressor protein, ~3x downregulated) and HI_0235 (alternative ribosome-rescue factor, ~5x downregulation) are differentially expressed compared to KW20 in cmnx_M3, antx_M3 and toxx_M3.

However, looking at another comparison, sxyx_M3 vs KW20_M3 (chosen at random), I see the same behaviour for these two genes. Perhaps it could be due some skewing of the data due to the problematic KW20_M3_C replicate that Rosie mentioned in the previous post.

A operon containing ribosomal protein HI_0776-779 (~2.5x upregulated) is differentially expressed compared to KW20 in cmnx_M1, antx_M2 and toxx_M2.

In summary, if there is any effect caused by having a spec cassette, is it not strong enough to really alter the set of differentially expressed genes.

Is something off about the KW20_M3_C sample

Josh points out that, in Lauri's analysis, the KW20_M3_C profile is an outlier from all the other samples (its squares are all pale blue, and its location is far from the other two replicates).

We need to find out what's going on here before doing further analysis.