Saturday, January 24, 2015

Basic Analysis 3 – Aligning, SAM and BAM



This is the third post in a series of posts that will detail the initial steps in analyzing the data. This pipeline is standard for next-generation sequencing data. 

At this point, we’ve sequenced our sample and we’ve taken a look at the output FASTQ files. The next step is to put our sequencing data in the context of our species (i.e. Haemophilus influenzae). To do this, we need to figure out where all these sequenced fragments came from in our H. influenzae genome. This process is known as aligning.

First, we actually need the H. influenzae genome – specifically the KW20 one. This information is stored in what’s known as a reference genome. I uploaded the file that I used for my alignment to the Google drive (Scott>references>sequence.fasta) or you can download it yourself from NCBI.

This is a FASTA file which contains a description of the sequence on the first line followed by the letters making up the entire KW20 genome.

Now that we have our genome, we want to find out where the fragments in the FASTQ files came from. To do this, we need to use an aligner. I have chosen BWA, a software package that’s been cited almost 5000 times since 2009

The BWA package comes with different aligning algorithms – the specific one I used was bwa mem which is recommended for Illumina reads due to optimized speed and alignment accuracy.

Running BWA is a multistep process. The documentation for each command with detailed information on parameters can be found here. http://bio-bwa.sourceforge.net/bwa.shtml
The first step in the BWA pipeline is to index our reference. This can be done with the following command:

bwa index reference.fasta

Why do we have to index? Imagine that you are reading a giant textbook and I ask you to find a particular sentence in the book. You can go line by line in the book to find the sentence, but a more optimal way would be to flip to the index section of the book and identify pages where key words from my sentence are used in the book. In the same way, aligning a small read to a huge genome is more efficient if you can narrow your search by comparing the letters of your read to some index.

The next step is the actual aligning process. I already mentioned that this step was done using bwa mem; here is the command:

bwa mem reference.fasta read1.fq.gz read2.fq.gz > out.sam
(if you are running this command on the UBC Zoology cluster, you have to specify the correct version of BWA in order to use bwa mem. This is done by replacing “bwa mem” in the above command with /Linux/bin/bwa-0.7.10 mem)

For paired end reads, you need to give bwa your (indexed) reference genome file and the two fastq files representing each read. The “>” character designates that you want to output the result of the command into a file called (in this case) out.sam.

And that’s it! This will run BWA using the default settings. If you’d prefer to change settings, it’s possible to specify parameters in the command. For example, if you want to run the command on multiple processors, you can insert –t 4 in the command to tell the program that you want to use (in this case) 4 processors.

The output of this command gives something known as a Sequence Alignment/Map (SAM) file. A SAM file is a tab-delimited file that stores information on where reads map to in the reference sequence and the quality of the mapping. SAM files are very rich in information, for more of an idea of their structure, see the detailed specification file.

The problem with SAM files is that since they contain a rich amount of information, they tend to be very large (many gigabytes in size). They are also formatted in a very human-friendly way which may be beneficial to human consumption but may require more effort on the part of the computer to do any sort of computation on these files. Therefore, it is typically required to convert these SAM files to a compressed binary format known as a BAM file.

A popular tool to do this is SAMTools.  However, in my workflow I used Sambamba which has better performance. The following command will convert a SAM file into a BAM file:

sambamba view -h -f bam -S out.sam > out.bam

For details on the possible parameters, check the documentation . http://lomereiter.github.io/sambamba/docs/sambamba-view.html Here, the view command is used to take a SAM file and to “view” it as a BAM file. I use the -h option to denote that my SAM file has a header. I use -f to tell the software that I want the output to be BAM file and the -S to say that the input is a SAM file. As I did earlier, I use “>” to save the results of the command in a file.

Finally, we have a compressed alignment file – but there’s one more step required of us. The original fastq files reports reads in a random order but it is often required by downstream applications that our reads are sorted by position in the genome. So our final step is to sort the reads in the BAM file by position:

sambamba sort -o out.sorted.bam out.bam

I used sambamba sort to sort the files. The -o denotes what I want the output file to be named (out.sorted.bam in this case). More parameters can be found here.

You can also do other things with Sambamba such as marking duplicates using sambamba markdup or can extracting alignment statistics using sambamba flagstat

I ran this entire process using a pair of bash scripts (run a set of terminal commands, extension is .sh) that can be found on the Google drive(Scott>scripts>bash). The first is align.sh which does the alignment, BAM conversion and sorting all in one step. The second file is align_setup.sh which is specific to my directory layout. It loops though all the fastq files and feeds them into the second script. If this is too vague, see the scripts for more details.

EDIT: I should also mention that I've uploaded the powerpoint slide I used for my lab presentation to my folder in the Drive. I don't know how helpful it will be without explanations, but it's there if you want to see it.

No comments: