This is the
second post in a series of posts that will detail the initial steps in
analyzing the data. This pipeline is standard for next-generation sequencing
data.
The output
from an Illumina platform is the FASTQ file. To understand this file, we’ll
jump right into a real example from our data:
QD_0_ACTTGA_L005_R1_001.fastq.gz
First thing
to note is the name given by the Illumina platform. The file name is composed
of 6 fields separated by underscores:
- QD_0 = sample name (defined by our lab, in this
case Δhfq in MIV at t=0)
- ACTTGA = barcode sequence ligated to each DNA
fragment (unique to each sample)
- L005 = sequencing lane on the flow cell
- 001 = set number
Each file is
can only hold a certain amount of sequencing data and anything exceeding that
size is split into multiple files denoted by the set number. This means that
files that differ only by set number are part of the same sample and must be
combined before analysis.
FASTQ files
have an extension of either .fastq or .fq but as you may have noticed, the file name I
gave above is technically not a FASTQ file. The .gz extension
denotes that this file is compressed by the Gzip
algorithm to minimize the space each file takes up.
To unzip the
file, simply navigate to the file in a terminal window and use the following
command:
gunzip
<filename>.gz
To gzip/compress
a file, use this command:
gzip
<filename>.gz
It is not
recommended to unzip the FASTQ file unless necessary (most bioinformatics software
can work on the zipped file) to keep file sizes as small as possible. However,
certain actions such as viewing the file will require you to first unzip it. To
view the first 12 lines of the FASTQ file without unzipping the whole file, do
this:
gunzip
-cd <filename>.gz | head -12
For the file
I mentioned above, the output looks like this:
@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093
1:N:0:ACTTGA
GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA
+
CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>
@HWI-ST765:225:C5K6GACXX:5:1101:1365:2094
1:N:0:ACTTGA
TTTCAAAGTCATAATTGATTAATTTGAGTTTATTTGGATCAAATGCAGTTGGCGATTGTTCTGCGGTCGTGTTAGTTAAGATTTGTAAAGAATCCCCTTGT
+
CCCFFFFFGHHHHJJIJIJIJHIJIIIGHIJIJJJJIIJFHIFJJJIJGGIIJIIIHGGHIIEH/B@BHHFBD@CCAC;>CECCCACEDC>ADD@CDDDDA
@HWI-ST765:225:C5K6GACXX:5:1101:1435:2137
1:N:0:ACTTGA
AGCCCTTAACCGTATTTATACGACCTAGTGGGACAAGTAAACGAGAGGAAAGTCCGAGCTACACAGGGCAGAGTGCCGGATAACGTCCGGGCGGCGTGAGC
+
CCCFFFFFHHHHDGIJJJJIJJJJJJJJIJIIJJJJIFGIJJJJJJJJJIJJHHIIJHHFFFFFEDDDDDDDDCDACBDBDDDDDDDDDDDBDBBD<@BD9
This is what
the guts of a FASTQ file looks like. Each read is contained in 4 lines of the
file. Since this file is 12 lines, it describes 3 reads.
@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093
1:N:0:ACTTGA
The first
line always begins with @ and must be a unique identifier of that read. The
information contained here is given by the Illumina platform and tells the
exact X and Y coordinate on the flow cell where this read was sequenced. For
more detailed information, go here.
GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA
The second
line contains the raw sequence for the read.
+
The third
line is a “+” followed by an optional description (nothing, in this case).
CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>
The final
line is the quality score for each base. Each score is ASCII encoded (to save
space), where the higher the score, the higher the quality. The following table
can be used to calculate the score. Take the value of the character and
subtract 33 to find the quality score.
Ex. for the character C, the quality score is 34 (67 – 33 = 34).
But what
does this score mean? The score is actually a Phred quality score where:
Q = –10 log
P
P = probability base was incorrectly called
A Phred
score of 34, therefore, means that a base has a 99.96% chance of being correct
whereas a score of 10 has a 90% probability of being correct. We can use this
scoring system to evaluate the quality of our sequencing results. An efficient
way of doing this is using the tool FastQC. To run this tool, install
the software and simply use the following command:
fastqc
<filename>
The program
computes and graphs a bunch of statistics on the quality of the data.
This image
shows quality distribution at each base position in a read. As typical of
Illumina sequencing data, the quality drops towards the end of the read due to
phasing and pre-phasing (see the first post)
Overall,
FastQC shows that most bases have a quality score greater than 30 (0.1% chance
of error for each base). This suggests that the sequencing data has
sufficiently high quality. For more information on the output from FastQC,
please see this PDF.
I will
upload the FastQC results and write a post on it eventually which will be
linked here.
The next
post in this series will attempt to detail how to map these reads to a
reference genome.
No comments:
Post a Comment