Monday, April 13, 2015
Checking strains, growing cultures
I think this post really belongs on RRResearch, so that's where you'll find it. http://rrresearch.fieldofscience.com/2015/04/new-rna-seq-work.html
Thursday, April 9, 2015
What's wrong with the RNAseq reads for sample kw20_M3_C?
In his March 20 post, Scott pointed out that the kw20_M3_C sample appears to be very odd. This is a replicate of samples kw20_M3_A and kw20_M3_B; they are all the wildtype strain KW20 after 100 minutes in the MIV competence induction medium.
The 'C' replicate doesn't cluster with the others in Scott's principal component analysis (see figures in Scott's post), and its quality control analysis gave very weird results, with sharp spikes of reads with much higher G+C than expected for H. influenzae. But 98% of the reads aligned to the H. influenzae genome, so these high-GC reads aren't due to contamination. Instead they must be very high coverage of GC-rich parts of the genome.
The red line shows the distribution of %G+C in the reads of this sample. The genome is 38% G+C, and from the QC analysis of the other samples we expect to see a smooth red bell curve peaking at about 40-41%, with some minor spikes at the high G+C end. Instead we see a series of jagged peaks, one at 38%, one at 51% and one at 55%, with a smaller peak at 60%. From the heights and widths of the peaks I'd estimate that about half of the reads have anomalously high %G+C.
Scott decided that many of the reads must be ribosomal RNA, but further analysis suggests that they are other abundant high-G+C noncoding RNAs, not rRNAs.
Many of the reads in the 'Overrepresented sequences' list are 'transfer-messenger RNA' (tmRNA). This ~350 nt RNA has important functions in translation. The gene is at 1357929-1357564; and looking at the nucleotide coverage of this region in our RNAseq sample (shows that more than 2,000,000 of the reads are of this gene. Even more reads are from RNase P. This 360 nt RNA ribozyme's gene is at 1746450-1746790: Coverage over most of this region is more than 1,000,000, so I estimate that about 3,000,000 of our reads are of RNase P. This analysis would suggest that about 50% of the ~10,000,000 reads from this sample are of RNase P and tmRNA.
To confirm this I directly checked whether half the reads in the sample do have anomalously high %G+C. I pulled out the first 28 reads from the original fastq file and directlychecked their %G+C. Thirteen of these have the low G+C (<41 expected="" i="" most="" of="">H. influenzae41>
mRNAs. The rest have a range of very high G+C (47.5-59.3%), and BLAST tells me that all but three of these reads are either RNase P or tmRNA. One of the others is 16S rRNA, and the other two are mRNAs that are also abundant in the replicate samples. This suggests that about half of the reads are indeed from RNase P and tmRNA.
Why would one sample have this problem? Perhaps it's just that much of the mRNA was degraded to fragments so small that they were lost in the library creation step? RNase P and tmRNA molecules might have survived because of their highly folded structures. To test this hypothesis, the Co-op tech is looking through the RNAseq notebook to find the gel photos that go with these RNA preps.
The 'C' replicate doesn't cluster with the others in Scott's principal component analysis (see figures in Scott's post), and its quality control analysis gave very weird results, with sharp spikes of reads with much higher G+C than expected for H. influenzae. But 98% of the reads aligned to the H. influenzae genome, so these high-GC reads aren't due to contamination. Instead they must be very high coverage of GC-rich parts of the genome.
The red line shows the distribution of %G+C in the reads of this sample. The genome is 38% G+C, and from the QC analysis of the other samples we expect to see a smooth red bell curve peaking at about 40-41%, with some minor spikes at the high G+C end. Instead we see a series of jagged peaks, one at 38%, one at 51% and one at 55%, with a smaller peak at 60%. From the heights and widths of the peaks I'd estimate that about half of the reads have anomalously high %G+C.
Scott decided that many of the reads must be ribosomal RNA, but further analysis suggests that they are other abundant high-G+C noncoding RNAs, not rRNAs.
Many of the reads in the 'Overrepresented sequences' list are 'transfer-messenger RNA' (tmRNA). This ~350 nt RNA has important functions in translation. The gene is at 1357929-1357564; and looking at the nucleotide coverage of this region in our RNAseq sample (shows that more than 2,000,000 of the reads are of this gene. Even more reads are from RNase P. This 360 nt RNA ribozyme's gene is at 1746450-1746790: Coverage over most of this region is more than 1,000,000, so I estimate that about 3,000,000 of our reads are of RNase P. This analysis would suggest that about 50% of the ~10,000,000 reads from this sample are of RNase P and tmRNA.
To confirm this I directly checked whether half the reads in the sample do have anomalously high %G+C. I pulled out the first 28 reads from the original fastq file and directlychecked their %G+C. Thirteen of these have the low G+C (<41 expected="" i="" most="" of="">H. influenzae41>
mRNAs. The rest have a range of very high G+C (47.5-59.3%), and BLAST tells me that all but three of these reads are either RNase P or tmRNA. One of the others is 16S rRNA, and the other two are mRNAs that are also abundant in the replicate samples. This suggests that about half of the reads are indeed from RNase P and tmRNA.
Why would one sample have this problem? Perhaps it's just that much of the mRNA was degraded to fragments so small that they were lost in the library creation step? RNase P and tmRNA molecules might have survived because of their highly folded structures. To test this hypothesis, the Co-op tech is looking through the RNAseq notebook to find the gel photos that go with these RNA preps.
Tuesday, April 7, 2015
Back to planning the new RNA-seq samples
I just taught my last class of the term, so now I can get back to experiments. The highest priority is preparing the new samples for RNA-seq analysis. The preliminary planning is described in the previous post, which has a table summarizing both the runs we've done and the provisional plans. Here it is again.
Second step: Streak out all the strains. Make especially sure that the antx and toxx strains are the unmarked ones, by checking their StrR and competence phenotypes. In fact, check the StrR and competence phenotypes of everything. Maybe use PCR to check the genotypes?
Third step: Inoculate an overnight culture of each strains, Grow each strain as specified and collect RNA-protect treated cell pellets (store at -80 °C).
Fourth step: Purify RNA from all samples except the 'kw20 Trizol' ones using the Qiagen RNeasy kit. Check nucleic acid concentration using the nanodrop.
Fifth step: Purify RNA from the 'kw20 Trizol' samples using Trizol. Check nucleic acid concentration using the nanodrop.
Sixth step: Treat each RNA sample with DNase ('DNA-free'?). Repurify using the Quigen RNeasy cleanup kit. Check RNA concentration using the nanodrop. If we have more than 24 samples, decide which ones to not sequence.
Seventh step: Give samples to Sunita (?), who will treat them with Ribo-zero using half as much RNA and 1/4 as much Ribo-zero as recommended (thus using only a 6-sample kit rather than a 24-sample kit). (Or maybe we will do this step ourselves.)
Eighth step: Sunita will prepare these samples for sequencing.
So, first step.
Which samples to collect? Has anything changed? Well, before the sabbatical visitor left she did isolate a new rpoD mutation causing hypercompetence. (Here I'll refer to it as rpo2). Should we include it in the RNA-seq samples? If we do, we'll have to cut out something else. Its competence phenotype (from a sBHI tiome course) is very similar to that of the rpoD mutant we are including, so I think we won't include this one.
What do we need to order?
The tech is streaking out all the strains on plain sBHI plates. We'll then restreak them on antibiotic plates to check their resistances, and confirm their transformation phenotypes when we grow them for the RNA preps.
Strains:
The plan is to get the RNA samples for these TO DO runs finished this month, before our Co-op tech finishes her work term and leaves us.
First step: Decide what samples to collect and check what supplies we'll need to order (RNA-Protect, RNeasy kit, Trizol, DNA-free, RNeasy MinElute cleanup kit, Ribo-zero?). We may collect a few more samples than the 24 we will be able to sequence.]
Second step: Streak out all the strains. Make especially sure that the antx and toxx strains are the unmarked ones, by checking their StrR and competence phenotypes. In fact, check the StrR and competence phenotypes of everything. Maybe use PCR to check the genotypes?
Third step: Inoculate an overnight culture of each strains, Grow each strain as specified and collect RNA-protect treated cell pellets (store at -80 °C).
Fourth step: Purify RNA from all samples except the 'kw20 Trizol' ones using the Qiagen RNeasy kit. Check nucleic acid concentration using the nanodrop.
Fifth step: Purify RNA from the 'kw20 Trizol' samples using Trizol. Check nucleic acid concentration using the nanodrop.
Sixth step: Treat each RNA sample with DNase ('DNA-free'?). Repurify using the Quigen RNeasy cleanup kit. Check RNA concentration using the nanodrop. If we have more than 24 samples, decide which ones to not sequence.
Seventh step: Give samples to Sunita (?), who will treat them with Ribo-zero using half as much RNA and 1/4 as much Ribo-zero as recommended (thus using only a 6-sample kit rather than a 24-sample kit). (Or maybe we will do this step ourselves.)
Eighth step: Sunita will prepare these samples for sequencing.
Which samples to collect? Has anything changed? Well, before the sabbatical visitor left she did isolate a new rpoD mutation causing hypercompetence. (Here I'll refer to it as rpo2). Should we include it in the RNA-seq samples? If we do, we'll have to cut out something else. Its competence phenotype (from a sBHI tiome course) is very similar to that of the rpoD mutant we are including, so I think we won't include this one.
What do we need to order?
- RNA-Protect? We have 86 ml of RNA-Protect; this is enough for 21.5 2 ml samples. Rather than buying more RNA-Protect ($278 for 200 ml), we'll just reduce our sample sizes slightly (1.6 ml) so we have enough for all of them. This should still give us plenty of RNA.
- Qiagen RNeasy purification kit: We have a nearly complete kit with 47 of its 50 columns left.
- Trizol: We have a nearly full bottle of Trizol, though it's at least several years old. It's been stored in the dark at 4 °C so it's probbly still OK. A new bottle would cost about $230, so if we decide ours is too old we'll just scrounge the few ml we'll need.
- RNeasy MinElute cleanup kit: our old kit has 23 columns remaining, almost enough for one treatment of each of our smples (before Ribo-zero treatment). But if we do the Ribo-zero work ourselves we'll need to also do a second cleanup after the Ribo-zero, so we'd need to buy another kit (about $400).
- Ribo-zero: If we do the treatment ourselves we'll need to order a 6-sample kit ($650). However our collaborators (the former RA's new lab) might do this for us. I'll check with her.
The tech is streaking out all the strains on plain sBHI plates. We'll then restreak them on antibiotic plates to check their resistances, and confirm their transformation phenotypes when we grow them for the RNA preps.
Strains:
- KW20: our wildtype strain (RR722, in freezer box 12): sensitive to all antibiotics, normal transformation.
- HI0659 (antitoxin) knockout (unmarked) (RR3158, in freezer box 39): should be sensitive to all antibiotics, not transformable at all.
- HI0660 (toxin) knockout (unmarked) (RR3159, in freezer box 39): should be sensitive to all antibiotics, normal transformation.
- rpoD753 (RR753, in freezer box 16): resistant to streptomycin, moderately hypercompetent in sBHI.
- hfq (marked deletion) (RR3187, in freezer box 40): resistant to spectinomycin, transformation down ~ tenfold.
- crp::miniTn10kan (RR540, in freezer box 3): resistant to kanamycin and streptomycin, not transformable at all.
- sxy::miniTn10kan (RR648, in freezer box 8): resistant to kanamycin, not transformable at all.
Friday, March 20, 2015
Planning more RNAseq runs
We need to decide what additional RNAseq runs we're going to do, given the papers we'd like to write and the limited resources we have. This post attempts to clarify the issues, following on a long discussion with Josh about this yesterday afternoon.
Here's a screenshot of the latest version of the table showing both what we've done and the runs we might still do.
We anticipate using this data in several papers:
1. A paper about the toxin/antitoxin genes (authors Hailey, Rosie, Josh, Sunita, Scott?): This paper will consider the gene expression data, the phylogeny, and the phenotype of knockouts in H. influenzae and A. pleuropneumoniae. For it we will directly need three MIV time-course replicates of KW20, and the toxin, antitoxin and double knockout strains (toxx, antx, taxx in the table above). We will also need to have identified the set of Sxy-dependent competence-induced genes, which will use the three replicates of the sxy-knockout time-course (sxyx). We can control for any effects of the spcR cassettes in the 'marked' strains by analyzing data we have.
We therefore need to still do:
2. A paper about the role of hfq in competence (authors Scott, Rosie, Josh, Sunita?): This paper will consider the phenotype of the hfq knockout and its effect on gene expression and cAMP levels. For it we will need the three KW20 MIV time courses, the 3 sxyx time courses to identify Sxy-dependent genes, and 2 crpx and hfqx MIV time-courses. We need to sort out ow cAMP levels change - Scott has done beta-gal assays. We may also need to identify the small RNAs that Hfq might interact with, unless this information is available from published hfq or other papers.
We therefore still need to do:
3. A paper about the rpoD hypercompetence mutation (authors Lauri, Rosie, Josh?): This paper will report the identification of a hypercompetence-causing point mutation in rpoD (maybe more mutations if Lauri's work pans out). In the rpoD mutant in sBHI, sxy RNA is up 1.6-fold at B1, 3.3-fold at B2, and 1.8-fold at B3. We will definitely need another replicate of these samples. Lauri's analysis of sxy-beta-gal promoter fusions also found that sxy transcription is up in this mutant. There aren't very many dramatic changes in other genes, and without any replication it's very hard to tell what might be significant.
We need:
4. A paper about identification of HI1457/HI1456 as a competence-induced operon. (authors ???; this result might be in the same paper as 3): These RNAs are up strongly in all three hypercompetence mutants in sBHI, and 1.6-2-fold in wildtype cells in MIV. HI1457 (291 aa) is tagged as 'opacity protein' and HI1456 (179 aa) as 'hypothetical protein'. (Note to self: this is not Hanna Tomb's por gene (dsbA homolog? HI0846).) Are they CRP-S regulated? They are way down (8-12-fold) in all the crpx MIV time-course samples, including the M0 sample, which is OD=0.2 in sBHI. But, oddly, they are down only 1.6-2-fold in the other crpx samples in sBHI (OD=0.02 and OD=0.6) and in the sxy knockout in sBHI and MIV.
5. A paper about regulation of competence (or as part of one of the other papers): Levels of sxy RNA are up 2-3 fold (highly significant) in the sxy1 hypercompetent mutant samples in sBHI (1.8-fold in B1, 3.1-fold in B2, and 2.7-fold in B3). This is unexpected under the model that these mutations just affect translatability. In the KW20 MIV time-courses, sxy RNA is up 11.9-fold at M1, 31.2-fold at M2, and 5.1-fold at M3.
6. A paper about the murE749 hypercompetence mutant: (or as part of one of the other papers): In the murE749 mutant in sBHI, sxy RNA is up 2.2-fold at B1, 3.1-fold at B2, and 2.1-fold at B3. These increases aren't significantly higher than those in the rpoD mutant, but the increase in competence is much more dramatic (100-fold more than rpoD753, 1000-or-more higher than KW20). So the difference in competence between murE749 and rpoD753 must be due to something other than the increase in sxy RNA. We have two replicates of the sBHI samples - another would be nice but isn't a very high priority.
Here's a screenshot of the latest version of the table showing both what we've done and the runs we might still do.
We anticipate using this data in several papers:
1. A paper about the toxin/antitoxin genes (authors Hailey, Rosie, Josh, Sunita, Scott?): This paper will consider the gene expression data, the phylogeny, and the phenotype of knockouts in H. influenzae and A. pleuropneumoniae. For it we will directly need three MIV time-course replicates of KW20, and the toxin, antitoxin and double knockout strains (toxx, antx, taxx in the table above). We will also need to have identified the set of Sxy-dependent competence-induced genes, which will use the three replicates of the sxy-knockout time-course (sxyx). We can control for any effects of the spcR cassettes in the 'marked' strains by analyzing data we have.
We therefore need to still do:
- 2 replicates of the antx (unmarked) MIV time-courses to replace those that t=urned out to be the comN knockout (cmnx).
- At least 1 replicate of the toxx (unmarked) MIV time course, to complement the 1 unmarked and 2 marked we already have.
2. A paper about the role of hfq in competence (authors Scott, Rosie, Josh, Sunita?): This paper will consider the phenotype of the hfq knockout and its effect on gene expression and cAMP levels. For it we will need the three KW20 MIV time courses, the 3 sxyx time courses to identify Sxy-dependent genes, and 2 crpx and hfqx MIV time-courses. We need to sort out ow cAMP levels change - Scott has done beta-gal assays. We may also need to identify the small RNAs that Hfq might interact with, unless this information is available from published hfq or other papers.
We therefore still need to do:
- At least 1 replicate of the hfqx MIV time-course
- At least 1 replicate of the crpx MIV time-course
- Maybe 2 replicates of one or more small-RNA samples (RNAs prepared with Trizol rather than with Qiagen RNeasy columns).
3. A paper about the rpoD hypercompetence mutation (authors Lauri, Rosie, Josh?): This paper will report the identification of a hypercompetence-causing point mutation in rpoD (maybe more mutations if Lauri's work pans out). In the rpoD mutant in sBHI, sxy RNA is up 1.6-fold at B1, 3.3-fold at B2, and 1.8-fold at B3. We will definitely need another replicate of these samples. Lauri's analysis of sxy-beta-gal promoter fusions also found that sxy transcription is up in this mutant. There aren't very many dramatic changes in other genes, and without any replication it's very hard to tell what might be significant.
We need:
- at least one replicate of the rpod sBHI samples
4. A paper about identification of HI1457/HI1456 as a competence-induced operon. (authors ???; this result might be in the same paper as 3): These RNAs are up strongly in all three hypercompetence mutants in sBHI, and 1.6-2-fold in wildtype cells in MIV. HI1457 (291 aa) is tagged as 'opacity protein' and HI1456 (179 aa) as 'hypothetical protein'. (Note to self: this is not Hanna Tomb's por gene (dsbA homolog? HI0846).) Are they CRP-S regulated? They are way down (8-12-fold) in all the crpx MIV time-course samples, including the M0 sample, which is OD=0.2 in sBHI. But, oddly, they are down only 1.6-2-fold in the other crpx samples in sBHI (OD=0.02 and OD=0.6) and in the sxy knockout in sBHI and MIV.
5. A paper about regulation of competence (or as part of one of the other papers): Levels of sxy RNA are up 2-3 fold (highly significant) in the sxy1 hypercompetent mutant samples in sBHI (1.8-fold in B1, 3.1-fold in B2, and 2.7-fold in B3). This is unexpected under the model that these mutations just affect translatability. In the KW20 MIV time-courses, sxy RNA is up 11.9-fold at M1, 31.2-fold at M2, and 5.1-fold at M3.
6. A paper about the murE749 hypercompetence mutant: (or as part of one of the other papers): In the murE749 mutant in sBHI, sxy RNA is up 2.2-fold at B1, 3.1-fold at B2, and 2.1-fold at B3. These increases aren't significantly higher than those in the rpoD mutant, but the increase in competence is much more dramatic (100-fold more than rpoD753, 1000-or-more higher than KW20). So the difference in competence between murE749 and rpoD753 must be due to something other than the increase in sxy RNA. We have two replicates of the sBHI samples - another would be nice but isn't a very high priority.
kw20_M3_C and hfqx_M0/M1 mixup
Take a look at this plot:

This is a principal component analysis (PCA) plot. It takes log normalized counts and plots samples on a 2D plane where each axis is a "principal component" that explains as much variance in the data as possible. The important thing to know is that similar samples will cluster in this plot. We see the different MIV timepoints clustering together in this case (only MIV samples were used to make this plot). A few things stick out though:
1) it looks like hfqx_M1 clusters strongly with the M0 samples; hfqxM0 clusters with M1 samples
2) similar thing seen with a couple of the cmnx samples (replicate C)
3) there appears to be two additional samples that are outliers (kw20_M3_C and cmnx_M2_C)
4) interestingly, it seems that all the crpx samples (except M0) cluster independent of timepoint
For hfqx, I have suspected a sample swap for a long time now; my original differential expression analysis had weird results and were corrected upon switching the samples. As of writing, I have renamed these samples on my harddrive but no where else.
The cmnx appears to be swapped too. Perhaps it would be best to throw out these samples and use the well-behaved ones if cmnx should be analyzed in the future.
The light blue (M2) cmnx outlier is the bad sample that I have brought up a few times now. This sample has very low counts and poor alignment. This sample should be thrown out from all analyses from now on.
The KW20 outlier is odd though.
Here we can see that the hfqx_M0 and hfqx_M1 cluster in the wrong spots and would be corrected with a swap. The two cmnx samples (green arrow) that are swapped cluster in M0 and appear very similar to each other. We can also see that the ugly cmnx_M2_C sample and kw20_M3_C cluster separately at the top of the heatmap (purple arrow).
I went to the FastQC files I made a few months ago and looked at the kw20_M3_C sample:
I saw this terrifying graph. There's a few things wrong with it. Other samples have a nice, smooth bell-curve shape, however this plot is extremely jagged. Another problem is that the mean GC content is too high. The theoretical blue curve should peak closer to 40% for Haemophilus influenzae. Why do we see this?
Well, I looked to see if there was some contamination but some 98% of reads align to the KW20 genome. Then I thought it could be a ribosomal issue.
The FastQC file also reports overrepresented sequences. There's a lot for this sample. I blasted a few of them, in particular, this one:
This 50bp stretch is found in 73523 reads (1.4% of reads in the sample). To my expectation, it aligned to the 16s rRNA gene in the KW20 genome. I looked at how many reads aligned to the 16s rRNA sequence and in some regions, more than 1.5 million reads are aligning to a given base in this sample. This is roughly 5x more than the average for all other samples. I'll repeat that for clarity, it looks like KW20_M3_C has roughly 5x more rRNA than the average sample. This sample also has the most 16s rRNA among all samples.
In summary, it looks like this sample is clustering away from all other samples because a large portion of the reads are from rRNA. Similar to cmnx_M2_C, the genes in these samples have low counts and that might explain why they cluster together. I don't know whether or not I should continue to use this sample. It will take a little bit more investigation.

This is a principal component analysis (PCA) plot. It takes log normalized counts and plots samples on a 2D plane where each axis is a "principal component" that explains as much variance in the data as possible. The important thing to know is that similar samples will cluster in this plot. We see the different MIV timepoints clustering together in this case (only MIV samples were used to make this plot). A few things stick out though:
1) it looks like hfqx_M1 clusters strongly with the M0 samples; hfqxM0 clusters with M1 samples
2) similar thing seen with a couple of the cmnx samples (replicate C)
3) there appears to be two additional samples that are outliers (kw20_M3_C and cmnx_M2_C)
4) interestingly, it seems that all the crpx samples (except M0) cluster independent of timepoint
For hfqx, I have suspected a sample swap for a long time now; my original differential expression analysis had weird results and were corrected upon switching the samples. As of writing, I have renamed these samples on my harddrive but no where else.
The cmnx appears to be swapped too. Perhaps it would be best to throw out these samples and use the well-behaved ones if cmnx should be analyzed in the future.
The light blue (M2) cmnx outlier is the bad sample that I have brought up a few times now. This sample has very low counts and poor alignment. This sample should be thrown out from all analyses from now on.
The KW20 outlier is odd though.
Here we can see that the hfqx_M0 and hfqx_M1 cluster in the wrong spots and would be corrected with a swap. The two cmnx samples (green arrow) that are swapped cluster in M0 and appear very similar to each other. We can also see that the ugly cmnx_M2_C sample and kw20_M3_C cluster separately at the top of the heatmap (purple arrow).
I went to the FastQC files I made a few months ago and looked at the kw20_M3_C sample:
I saw this terrifying graph. There's a few things wrong with it. Other samples have a nice, smooth bell-curve shape, however this plot is extremely jagged. Another problem is that the mean GC content is too high. The theoretical blue curve should peak closer to 40% for Haemophilus influenzae. Why do we see this?
Well, I looked to see if there was some contamination but some 98% of reads align to the KW20 genome. Then I thought it could be a ribosomal issue.
The FastQC file also reports overrepresented sequences. There's a lot for this sample. I blasted a few of them, in particular, this one:
CACCCATTGAGTTCCTATTTGGTCTTGCTACGAGTGGAGTTTACCCTGCT
This 50bp stretch is found in 73523 reads (1.4% of reads in the sample). To my expectation, it aligned to the 16s rRNA gene in the KW20 genome. I looked at how many reads aligned to the 16s rRNA sequence and in some regions, more than 1.5 million reads are aligning to a given base in this sample. This is roughly 5x more than the average for all other samples. I'll repeat that for clarity, it looks like KW20_M3_C has roughly 5x more rRNA than the average sample. This sample also has the most 16s rRNA among all samples.
In summary, it looks like this sample is clustering away from all other samples because a large portion of the reads are from rRNA. Similar to cmnx_M2_C, the genes in these samples have low counts and that might explain why they cluster together. I don't know whether or not I should continue to use this sample. It will take a little bit more investigation.
Thursday, March 19, 2015
Does having a spec cassette make a difference?
I was asked to check if the strains with the spec cassette have anything unusual in terms of gene expression.
First I needed to determine which samples had the spec cassette. To do this, I ran a grep command on the FASTQ files of may samples and looked for the string "GGGGATCCGTCGACCTGCAGTTCGAAG" which is a sequence in the spec cassette.
gunzip -cd filename.fq.gz | grep GGGGATCCGTCGACCTGCAGTTCGAAG
The following samples had reads that match this string:
cnmx
antx
toxx
The following samples did not have reads that match this string:
taxx
everything else
I checked every sample for strains in red and they were all consistent within a given strain. I only checked a single condition for everything else and not a single match was found in these.
For these marked strains, I pulled out all the genes that were significantly different compared to KW20 under the same condition. (ie. cmnx_M0 vs kw20_M0, antx_M1 vs kw20_M1, etc). Among these 12 samples, there was a total of 222 differentially expressed genes compared to KW20 (190 unique genes)
Results:
Only one gene was found differentially expressed in more than 3 samples. HI_0658 (ATP-binding cassette, subfamily F, member 3) was found differentially expressed in 4 (33%) of the samples, however all four samples were antx so this is not a spec-specific effect.
A few genes (~10 in total) show up in 3 samples. HI_0062 (dnaK suppressor protein, ~3x downregulated) and HI_0235 (alternative ribosome-rescue factor, ~5x downregulation) are differentially expressed compared to KW20 in cmnx_M3, antx_M3 and toxx_M3.
However, looking at another comparison, sxyx_M3 vs KW20_M3 (chosen at random), I see the same behaviour for these two genes. Perhaps it could be due some skewing of the data due to the problematic KW20_M3_C replicate that Rosie mentioned in the previous post.
A operon containing ribosomal protein HI_0776-779 (~2.5x upregulated) is differentially expressed compared to KW20 in cmnx_M1, antx_M2 and toxx_M2.
In summary, if there is any effect caused by having a spec cassette, is it not strong enough to really alter the set of differentially expressed genes.
First I needed to determine which samples had the spec cassette. To do this, I ran a grep command on the FASTQ files of may samples and looked for the string "GGGGATCCGTCGACCTGCAGTTCGAAG" which is a sequence in the spec cassette.
gunzip -cd filename.fq.gz | grep GGGGATCCGTCGACCTGCAGTTCGAAG
The following samples had reads that match this string:
cnmx
antx
toxx
The following samples did not have reads that match this string:
taxx
everything else
I checked every sample for strains in red and they were all consistent within a given strain. I only checked a single condition for everything else and not a single match was found in these.
For these marked strains, I pulled out all the genes that were significantly different compared to KW20 under the same condition. (ie. cmnx_M0 vs kw20_M0, antx_M1 vs kw20_M1, etc). Among these 12 samples, there was a total of 222 differentially expressed genes compared to KW20 (190 unique genes)
Results:
Only one gene was found differentially expressed in more than 3 samples. HI_0658 (ATP-binding cassette, subfamily F, member 3) was found differentially expressed in 4 (33%) of the samples, however all four samples were antx so this is not a spec-specific effect.
A few genes (~10 in total) show up in 3 samples. HI_0062 (dnaK suppressor protein, ~3x downregulated) and HI_0235 (alternative ribosome-rescue factor, ~5x downregulation) are differentially expressed compared to KW20 in cmnx_M3, antx_M3 and toxx_M3.
However, looking at another comparison, sxyx_M3 vs KW20_M3 (chosen at random), I see the same behaviour for these two genes. Perhaps it could be due some skewing of the data due to the problematic KW20_M3_C replicate that Rosie mentioned in the previous post.
A operon containing ribosomal protein HI_0776-779 (~2.5x upregulated) is differentially expressed compared to KW20 in cmnx_M1, antx_M2 and toxx_M2.
In summary, if there is any effect caused by having a spec cassette, is it not strong enough to really alter the set of differentially expressed genes.
Is something off about the KW20_M3_C sample
Josh points out that, in Lauri's analysis, the KW20_M3_C profile is an outlier from all the other samples (its squares are all pale blue, and its location is far from the other two replicates).
We need to find out what's going on here before doing further analysis.
We need to find out what's going on here before doing further analysis.
Subscribe to:
Posts (Atom)