Thursday, January 8, 2015

Basic Analysis 2 - FASTQ and Base Quality



This is the second post in a series of posts that will detail the initial steps in analyzing the data. This pipeline is standard for next-generation sequencing data. 

The output from an Illumina platform is the FASTQ file. To understand this file, we’ll jump right into a real example from our data:

QD_0_ACTTGA_L005_R1_001.fastq.gz

First thing to note is the name given by the Illumina platform. The file name is composed of 6 fields separated by underscores:

          - QD_0 = sample name (defined by our lab, in this case Δhfq in MIV at t=0)
          - ACTTGA = barcode sequence ligated to each DNA fragment (unique to each sample)
          - L005 = sequencing lane on the flow cell
          - 001 = set number

Each file is can only hold a certain amount of sequencing data and anything exceeding that size is split into multiple files denoted by the set number. This means that files that differ only by set number are part of the same sample and must be combined before analysis.

FASTQ files have an extension of either .fastq or .fq but as you may have noticed, the file name I gave above is technically not a FASTQ file. The .gz extension denotes that this file is compressed by the Gzip algorithm to minimize the space each file takes up.

To unzip the file, simply navigate to the file in a terminal window and use the following command:

gunzip <filename>.gz

To gzip/compress a file, use this command:

gzip <filename>.gz

It is not recommended to unzip the FASTQ file unless necessary (most bioinformatics software can work on the zipped file) to keep file sizes as small as possible. However, certain actions such as viewing the file will require you to first unzip it. To view the first 12 lines of the FASTQ file without unzipping the whole file, do this:

gunzip -cd <filename>.gz | head -12

For the file I mentioned above, the output looks like this:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA
GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA
+
CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>
@HWI-ST765:225:C5K6GACXX:5:1101:1365:2094 1:N:0:ACTTGA
TTTCAAAGTCATAATTGATTAATTTGAGTTTATTTGGATCAAATGCAGTTGGCGATTGTTCTGCGGTCGTGTTAGTTAAGATTTGTAAAGAATCCCCTTGT
+
CCCFFFFFGHHHHJJIJIJIJHIJIIIGHIJIJJJJIIJFHIFJJJIJGGIIJIIIHGGHIIEH/B@BHHFBD@CCAC;>CECCCACEDC>ADD@CDDDDA
@HWI-ST765:225:C5K6GACXX:5:1101:1435:2137 1:N:0:ACTTGA
AGCCCTTAACCGTATTTATACGACCTAGTGGGACAAGTAAACGAGAGGAAAGTCCGAGCTACACAGGGCAGAGTGCCGGATAACGTCCGGGCGGCGTGAGC
+
CCCFFFFFHHHHDGIJJJJIJJJJJJJJIJIIJJJJIFGIJJJJJJJJJIJJHHIIJHHFFFFFEDDDDDDDDCDACBDBDDDDDDDDDDDBDBBD<@BD9

This is what the guts of a FASTQ file looks like. Each read is contained in 4 lines of the file. Since this file is 12 lines, it describes 3 reads.

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA

The first line always begins with @ and must be a unique identifier of that read. The information contained here is given by the Illumina platform and tells the exact X and Y coordinate on the flow cell where this read was sequenced. For more detailed information, go here.

GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA

The second line contains the raw sequence for the read.

+

The third line is a “+” followed by an optional description (nothing, in this case).

CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>

The final line is the quality score for each base. Each score is ASCII encoded (to save space), where the higher the score, the higher the quality. The following table can be used to calculate the score. Take the value of the character and subtract 33 to find the quality score.
 
  
Ex. for the character C, the quality score is 34 (67 – 33 = 34).

But what does this score mean? The score is actually a Phred quality score where:

Q = –10 log P

Q = quality score 
P = probability base was incorrectly called

A Phred score of 34, therefore, means that a base has a 99.96% chance of being correct whereas a score of 10 has a 90% probability of being correct. We can use this scoring system to evaluate the quality of our sequencing results. An efficient way of doing this is using the tool FastQC.  To run this tool, install the software and simply use the following command: 

fastqc <filename>

The program computes and graphs a bunch of statistics on the quality of the data.





This image shows quality distribution at each base position in a read. As typical of Illumina sequencing data, the quality drops towards the end of the read due to phasing and pre-phasing (see the first post)

 
Overall, FastQC shows that most bases have a quality score greater than 30 (0.1% chance of error for each base). This suggests that the sequencing data has sufficiently high quality. For more information on the output from FastQC, please see this PDF.

I will upload the FastQC results and write a post on it eventually which will be linked here.

The next post in this series will attempt to detail how to map these reads to a reference genome.

No comments: