Saturday, January 24, 2015

Antitoxin/comN Sample Mixup

I was looking at the BAM files I produced from the sequencing to verify all the strain had the genotype expected of them. Most of them did, but there was anissue with the supposed “cyaA” strain and I also noticed a problem with the new antitoxin knockout strains (antx). It looks like there is expression of the antitoxin gene (HI0659) in the supposed knockouts, specifically in samples denoted antx_*_C and antx_*_E. I have taken screenshots that provide evidence of this, but I do not have access to them at the moment. If I remember to, I will edit this part and add them.

At this point, I figured that perhaps something else is knocked out in these strains. I tried a bunch of things to figure out if this was the case. I tried looking at the unmapped read to see if I could find some sort of antibiotic resistance cassette, I tried assembling a genome from the sequencing data and looking for an insert (I learned that assembling a genome using raw RNA seq data is not straightforward), and I tried looking though coverage graphs to see if I could detect some deletion. In the end, I ran a differential expression analysis on the problematic samples and noticed that HI0938 was hugely downregulated. Looking at the “antx” files in IGV demonstrates what I saw:

Here, you can see the two KW20 samples (first two tracks) and the old antitoxin sample “A” (third track) had expression of comN. But as for antx samples “C” and “E” (bottom two tracks), there is an obvious deletion. This is consistent with all samples “C” and “E” for the supposed antitoxin knockout samples.

Looking at the fastq files for these comN knockout samples, I've discovered reads for a spectinomycin cassette similar to this one:
 BLASTing the first 21 letters gets me a perfect match to KW20, positions 997240 to 997260, which correspond to the beginning of comN.

In summary, there must have been some mix up at some point between the antitoxin knockout strain and the comN knockout strain. This is what I know:
- old RNA seq data for antitoxin knockout is valid (i.e. antx_*_A)
- new RNA seq data for antitoxin knockout has an issue (i.e. antx_*_C, antx_*_E): the data actually represents a comN deletion mutant where there is a spec cassette in place of the comN gene

Basic Analysis 3 – Aligning, SAM and BAM

This is the third post in a series of posts that will detail the initial steps in analyzing the data. This pipeline is standard for next-generation sequencing data. 

At this point, we’ve sequenced our sample and we’ve taken a look at the output FASTQ files. The next step is to put our sequencing data in the context of our species (i.e. Haemophilus influenzae). To do this, we need to figure out where all these sequenced fragments came from in our H. influenzae genome. This process is known as aligning.

First, we actually need the H. influenzae genome – specifically the KW20 one. This information is stored in what’s known as a reference genome. I uploaded the file that I used for my alignment to the Google drive (Scott>references>sequence.fasta) or you can download it yourself from NCBI.

This is a FASTA file which contains a description of the sequence on the first line followed by the letters making up the entire KW20 genome.

Now that we have our genome, we want to find out where the fragments in the FASTQ files came from. To do this, we need to use an aligner. I have chosen BWA, a software package that’s been cited almost 5000 times since 2009

The BWA package comes with different aligning algorithms – the specific one I used was bwa mem which is recommended for Illumina reads due to optimized speed and alignment accuracy.

Running BWA is a multistep process. The documentation for each command with detailed information on parameters can be found here.
The first step in the BWA pipeline is to index our reference. This can be done with the following command:

bwa index reference.fasta

Why do we have to index? Imagine that you are reading a giant textbook and I ask you to find a particular sentence in the book. You can go line by line in the book to find the sentence, but a more optimal way would be to flip to the index section of the book and identify pages where key words from my sentence are used in the book. In the same way, aligning a small read to a huge genome is more efficient if you can narrow your search by comparing the letters of your read to some index.

The next step is the actual aligning process. I already mentioned that this step was done using bwa mem; here is the command:

bwa mem reference.fasta read1.fq.gz read2.fq.gz > out.sam
(if you are running this command on the UBC Zoology cluster, you have to specify the correct version of BWA in order to use bwa mem. This is done by replacing “bwa mem” in the above command with /Linux/bin/bwa-0.7.10 mem)

For paired end reads, you need to give bwa your (indexed) reference genome file and the two fastq files representing each read. The “>” character designates that you want to output the result of the command into a file called (in this case) out.sam.

And that’s it! This will run BWA using the default settings. If you’d prefer to change settings, it’s possible to specify parameters in the command. For example, if you want to run the command on multiple processors, you can insert –t 4 in the command to tell the program that you want to use (in this case) 4 processors.

The output of this command gives something known as a Sequence Alignment/Map (SAM) file. A SAM file is a tab-delimited file that stores information on where reads map to in the reference sequence and the quality of the mapping. SAM files are very rich in information, for more of an idea of their structure, see the detailed specification file.

The problem with SAM files is that since they contain a rich amount of information, they tend to be very large (many gigabytes in size). They are also formatted in a very human-friendly way which may be beneficial to human consumption but may require more effort on the part of the computer to do any sort of computation on these files. Therefore, it is typically required to convert these SAM files to a compressed binary format known as a BAM file.

A popular tool to do this is SAMTools.  However, in my workflow I used Sambamba which has better performance. The following command will convert a SAM file into a BAM file:

sambamba view -h -f bam -S out.sam > out.bam

For details on the possible parameters, check the documentation . Here, the view command is used to take a SAM file and to “view” it as a BAM file. I use the -h option to denote that my SAM file has a header. I use -f to tell the software that I want the output to be BAM file and the -S to say that the input is a SAM file. As I did earlier, I use “>” to save the results of the command in a file.

Finally, we have a compressed alignment file – but there’s one more step required of us. The original fastq files reports reads in a random order but it is often required by downstream applications that our reads are sorted by position in the genome. So our final step is to sort the reads in the BAM file by position:

sambamba sort -o out.sorted.bam out.bam

I used sambamba sort to sort the files. The -o denotes what I want the output file to be named (out.sorted.bam in this case). More parameters can be found here.

You can also do other things with Sambamba such as marking duplicates using sambamba markdup or can extracting alignment statistics using sambamba flagstat

I ran this entire process using a pair of bash scripts (run a set of terminal commands, extension is .sh) that can be found on the Google drive(Scott>scripts>bash). The first is which does the alignment, BAM conversion and sorting all in one step. The second file is which is specific to my directory layout. It loops though all the fastq files and feeds them into the second script. If this is too vague, see the scripts for more details.

EDIT: I should also mention that I've uploaded the powerpoint slide I used for my lab presentation to my folder in the Drive. I don't know how helpful it will be without explanations, but it's there if you want to see it.

Monday, January 19, 2015

cyaA/crp Sample Mixup

I have made graphs showing the coverage for the crp and cyaA genes for the supposedly cya mutants and two KW20 samples:

The cya graph shows that the whole gene is intact and there is no obvious interruption. The crp graph, however, shows that there is a dramatic interruption in the "cya" strains (and I have verified that this happens only in these strains).

This interruption occurs between position 1015189 and 1015190 in the reference sequence, giving us the following: CAAAATACATCAC[drop in coverage]GTCAAG

The coverage drops about 10x between the C and G, and drops another ~2x after the G. I can't say where exactly the insertion begins with this data is presented (it depends on the first base of whatever is inserted here and may be altered by mismatch tolerance during the alignment process) but it is pretty spot on to where Rosie had predicted in her previous post.

Anyway, what this data tells me is that all our "cyax" strains are actually only "crpx" strains. I'll keep you updated if there are any additional developments in this issue.

Friday, January 16, 2015

Sorting out the identities of the supposedly 'cya' samples

There's some confusion (an euphemism for my lab-notebook errors?) about the identities of the strains that are annotated as cya-.  Scott reported at lab meeting that the sequence reads appeared to have intact cya genes.

Part of the confusion may be because the cya (or crp) knockouts I might have used are not deletion mutants (they're not part of the set of deletion mutants constructed by Sunita).  Instead they're old mutants (> 20 years old) that were created by random insertion of the transposon construct miniTn10Kan, which has a kanamycin cassette (from Tn903) inserted between the IS10 ends of transposon Tn10.  Alignment of sequence reads from such a mutant will show that all the bases of the gene are present and transcribed, but that the reads that include part of the insertion will appear to all begin or end at the same position.

The other problem is genuine confusion about which strain(s) I actually used for these cultures.  So I’ve checked my lab notes, which are also not entirely consistent.

The master_sample_key table lists these samples as strain RR688.  Since RR688 is a clinical strain sent by Simon Kroll, this is probably a typo for strain RR668 (= cya::miniTn10kan). 

For Day F, my notes say that I used a culture of strain RR540 (= crp::miniTn10kan) which I had previously used in Day E.  I think this agrees with what Scott found in the RNA-seq results, that the supposedly cya knockout strains have intact cya genes but could have disrupted crp genes.

My notes for Day E are a bit confused.  They seem to say that the RR540 and RR668 strains didn’t grow and that I instead used a double knockout strain (RR3006) that was cya::Cm and crp::miniTn10kan.  I might have done this, but I suspect my notes are wrong.

Luckily, the sequence reads are the ultimate authority.  So I've eamiled Scott asking him to confirm that the reads from both sets of ‘cya’ cultures (Days E and F) have disrupted crp genes and intact cya and sxy genes.  Because this is a miniTn10kan insertion, all the crp sequences should be present, but the read alignments that span this sequence (CAAAATACATCACG*TCAAGTCACGAAT) should all end at the *.  (I can’t give the genome sequence position number - this location information is taken from a pre-genome paper.)

Later:  Scott, here's the KW20 genome location I get with a BLAST search for this sequence (reassuringly, this is indeed part of the crp gene):

In anticipation that this is what he'll find, I've changed the strain names and descriptions for these samples in the master_sample_key table.

Polishing the sample-key table

Scott has made us an essential table called master_sample_key, which lists all the critical information about each of our 96 RNA-seq samples.  It's still a bit awkward to use, because of the ways we've named the samples.

So I've added a 'Sort_by' column to the table.  This column lets us sort the samples into a experimentally logical order.

And I've added a 'Group' column, with colour shading, to show which samples belong together as part of the same strain/treatment set (e.g. all the KW20 samples in MIV).

Sunday, January 11, 2015

An in-depth look at data quality (Part 3)

In this part, I will look at sequences that are overrepresented according to FastQC. The software flags any 50-mer sequence that makes up more than 0.1% the reads in the fastq files. If there is an abundance of rRNA sequences or if there is a large, redundant contaminant then this test will pick it up.

File that store overrepresented sequences in on Google Drive: Scott>Quality Analysis>fastqc_overrepresented.xlsx

First thing to note is that all overrepresented sequences are either a sequencing artifact or match some region of the H. influenzae genome (i.e. no overrepresented contaminant).

Here is what the abundant sequences matched to:

HI_0139 - outer membrane protein
Tended to be overrepresented in hfq, toxin, antitoxin and toxin/antitoxin knockouts in MIV at t=10 and t=30
Note: this is the only overrepresented gene that encodes a protein product

Between HI_1677-HI_1678 - RNase P
Overrepresented in every sample

Between HI_1281-HI_1282 - tmRNA
Overrepresented in most samples, no apparent pattern.

Between HI_0957-HI_0958 - C4 antisense RNA
Overrepresented in many samples (not a single hypercompetent sample though)
Note: the gene sits right next to CRP

Between HI_0857-HI_0858 - 6S RNA
Overrepresented in the old KW20 samples at t=100 in MIV (and one sxyx sample at the same timepoint)

Between HI_0631-HI_0632 -  tRNA Thr
Not sure why only this tRNA showed up. Overrepresented in the old KW20 samples at t=100 in MIV.

The resulting hits tend to have the following characteristics:
-RNA-based function
-transcribed from a small gene
-would expect to be highly expressed

Intuitively, I suppose this makes sense because a highly transcribed gene has an abundance of transcripts floating around and a smaller transcript gives rise to a smaller range of possible reads.

Anyway, the take-home message of this is that nothing strange is seen here. There were no hits for rRNA or any organism outside of H. influenzae. This is good.

So, this wraps up my in-depth look at the data quality. Ignoring a particularly poor antitoxin knockout sample, I have yet to see anything that would suggest that the data quality is poor - which is great. My next goal is to verify that the mutant strains are actually carry a mutation and that there is no mixup between timepoints (although I have some evidence this is the case for at least a pair of samples).