This is the third
post in a series of posts that will detail the initial steps in analyzing the
data. This pipeline is standard for next-generation sequencing data.
At this
point, we’ve sequenced our sample and we’ve taken a look at the output FASTQ
files. The next step is to put our sequencing data in the context of our
species (i.e. Haemophilus influenzae).
To do this, we need to figure out where all these sequenced fragments came from
in our H. influenzae genome. This
process is known as aligning.
First, we
actually need the
H. influenzae
genome – specifically the KW20 one. This information is stored in what’s known
as a reference genome. I uploaded the file that I used for my alignment to the Google
drive (Scott>references>sequence.fasta) or you can download it yourself
from
NCBI.
This is a
FASTA file which contains a description of the sequence on the first line
followed by the letters making up the entire KW20 genome.
Now that we
have our genome, we want to find out where the fragments in the FASTQ files
came from. To do this, we need to use an aligner. I have chosen
BWA, a
software package that’s been
cited almost 5000 times since 2009.
The BWA
package comes with different aligning algorithms – the specific one I used was bwa mem which is recommended for Illumina reads due to optimized speed
and alignment accuracy.
Running BWA is a multistep process.
The documentation for each command with detailed information on parameters can
be found here. http://bio-bwa.sourceforge.net/bwa.shtml
The first step in the BWA pipeline
is to index our reference. This can be done with the following command:
bwa index reference.fasta
Why do we have to index? Imagine
that you are reading a giant textbook and I ask you to find a particular
sentence in the book. You can go line by line in the book to find the sentence,
but a more optimal way would be to flip to the index section of the book and
identify pages where key words from my sentence are used in the book. In the
same way, aligning a small read to a huge genome is more efficient if you can
narrow your search by comparing the letters of your read to some index.
The next step is the actual aligning
process. I already mentioned that this step was done using bwa mem; here
is the command:
bwa mem reference.fasta read1.fq.gz read2.fq.gz > out.sam
(if you are running this command on
the UBC Zoology cluster, you have to specify the correct version of BWA in
order to use bwa mem. This is done by replacing “bwa mem” in the above command with /Linux/bin/bwa-0.7.10 mem)
For paired end reads, you need to
give bwa your (indexed) reference genome file and the two fastq files
representing each read. The “>” character designates that you want
to output the result of the command into a file called (in this case) out.sam.
And that’s it! This will run BWA
using the default settings. If you’d prefer to change settings, it’s possible
to specify parameters in the command. For example, if you want to run the
command on multiple processors, you can insert –t 4 in the command to tell the program
that you want to use (in this case) 4 processors.
The output of this command gives
something known as a Sequence Alignment/Map (SAM) file. A SAM file is a tab-delimited
file that stores information on where reads map to in the reference sequence
and the quality of the mapping. SAM files are very rich in information, for
more of an idea of their structure, see the detailed specification file.
The problem with SAM files is that
since they contain a rich amount of information, they tend to be very large
(many gigabytes in size). They are also formatted in a very human-friendly way
which may be beneficial to human consumption but may require more effort on the
part of the computer to do any sort of computation on these files. Therefore,
it is typically required to convert these SAM files to a compressed binary
format known as a BAM file.
A popular tool to do this is
SAMTools. However, in my workflow I used Sambamba which has better performance. The following command will convert a SAM file
into a BAM file:
sambamba view -h -f bam -S out.sam > out.bam
For details on the possible
parameters, check the documentation . http://lomereiter.github.io/sambamba/docs/sambamba-view.html
Here, the view command is used to take a SAM file and to “view” it as a BAM
file. I use the -h option to denote that my SAM file
has a header. I use -f to tell the software that I want
the output to be BAM file and the -S to say that the input is a SAM
file. As I did earlier, I use “>” to save the results of the command
in a file.
Finally, we have a compressed
alignment file – but there’s one more step required of us. The original fastq
files reports reads in a random order but it is often required by downstream
applications that our reads are sorted by position in the genome. So our final
step is to sort the reads in the BAM file by position:
sambamba sort -o out.sorted.bam
out.bam
I used sambamba sort to sort the files. The -o denotes what I want the
output file to be named (out.sorted.bam in this case). More parameters can
be found here.
I ran this entire process using a
pair of bash scripts (run a set of terminal commands, extension is .sh) that
can be found on the Google drive(Scott>scripts>bash). The first is
align.sh which does the alignment, BAM conversion and sorting all in one step.
The second file is align_setup.sh which is specific to my directory layout. It
loops though all the fastq files and feeds them into the second script. If this
is too vague, see the scripts for more details.
EDIT: I should also mention that I've uploaded the powerpoint slide I used for my lab presentation to my folder in the Drive. I don't know how helpful it will be without explanations, but it's there if you want to see it.