The Sense Strand

Sunday, January 11, 2015

A note about the original RNA-seq data

I have re-analyzed it down the same pipeline I used for the new data. To keep it consistent with the naming system Josh has implemented, I renamed the files as follows:

Original Name	Actual Strain	New Name
K_XXX_A	kw20	kw20_MX_A
K_XXX_B	kw20	kw20_MX_B
S_XXX_A	toxx	toxx_MX_A
S_XXX_B	sxyx	sxyx_MX_B
U_XXX_A	sxyx	sxyx_MX_A
U_XXX_B	antx	antx_MX_A

Note that a few of the samples were mixed up originally (hence the "Actual Strain" column), this naming scheme takes that into account.

Edit: A key for all the samples has been uploaded to the main folder of the drive as a spreadsheet under the name master_sample_key

An in-depth look at data quality (Part 2)

In this episode, we’re going to take a look at the FastQC test results. FastQC runs 10 tests on every .fastq file and gives the data a “PASS”, “FAIL”, or “WARN” based on some criteria. The FastQC manual where I get the information from can be found here. Let’s just jump right into it.

Note: Relevant document on the Google Drive: Scott>Quality Analysis>fastqc_extraction.sh

Raw FastQC output will be put in Scott>Quality Analysis>fastqc

Per base sequence quality

167 PASS, 0 WARN, 25 FAIL

Shows the median quality scorehttp://sensestrand.blogspot.ca/2015/01/basic-analysis-2-fastq-and-base-quality.html at every base position. Fails if the lower quartile score for any base is less than 5 or if the median for any base is less than 20.

- All second read (R2) files from the old samples failed.

- sample antx_M2_C was the only new sample to fail, happened on the first read (R1)

- all failures happened because of a bad lower quartile score base in the last base position:

No significant problems here.

Per sequence quality scores

192 PASS, 0 WARN, 0 FAIL

Shows the distribution of the mean quality score for every read. Warning is given if the most frequently observed mean quality is less than 27 (0.2% error rate). All passed.

Per base sequence content

0 PASS, 0 WARN, 192 FAIL

Shows the proportion of each base (ATGC) at every position across all reads. Fails if difference between A and T or C and G is >20% at any position.

There’s clearly something weird happening at the beginning. This pattern is strongly consistent across all the sequences I looked at, even between old and new data. I think I understand why, just wait for the answer until after the next test. . .

Per base GC content

0 PASS, 0 WARN, 192 FAIL

Plots GC content for each base across all reads. Fails if any base differs from mean GC by more than 10%.

We see results here consistent with the previous test. Again, this graph is highly consistent, even between old and new data. On the Illumina FAQ, they state: http://support.illumina.com/faqs.html

“Why is GC high in the first few bases?

It is normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. In genomic DNA sequencing, the base composition is usually quite uniform across all bases, but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. We believe this effect is caused by the "not so random" nature of the random priming process used in the protocol. This might explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. This is because the random priming is not truly random and the first 12 bases (the length of two hexamers) are biased towards sequences that prime more efficiently. This is normal and expected.”

This paper http://nar.oxfordjournals.org/content/38/12/e131 also discusses the issue.

Basically, we should expect these tests to fail because there’s a bias for GC sequences in the first 12 positions. In fact, seeing this (and seeing how it’s so consistent) should give us confidence in our data.

Per sequence GC content

15 PASS, 149 WARN, 28 FAIL

Graphs the distribution of GC content per read next to a theoretical distribution calculated from the data. Deviation from the theoretical distribution causes warning or failure.

The graphs above show what the typical results look like for PASS/WARN/FAIL. There appears to be a set of GC rich reads (circled peaks) that creates a wider theoretical distribution in the WARN and FAIL samples. These could be unusually abundant GC-rich transcripts (PCR duplicates or just highly expressed genes) that are from H. influenzae or it could be contamination.

I pursued this idea and made the following graph:

This graph shows a couple things. One is that the samples with the best GC distribution were aligned best to the H. influenzae reference genome. The second thing is that the old samples tended to fail the test more than the new samples (and tend to not align as well). 45% of the old samples fail and only 5% of the new ones do. This supports the idea I offered in the previous post that the newer samples are less contaminated.

If contamination is causing the test to fail in most of the cases, I would say it’s not that big of a deal since more than 97% of the reads still align to the reference genome (meaning contamination probably represents less than 3% of the reads)

Per base N content

168 PASS, 24 WARN, 0 FAIL

An N is used in place of a nucleotide when Illumina cannot determine a base during sequencing. This test raises a warning if any base position in a read has more than 5% N’s. All samples with a warning are old samples and are the second read (R2), similar to the first test. As you may guess, the last nucleotide is the one failing the test – it appears that about 10-15% of the nucleotides in the last position are N’s. This is not surprising considering the base quality in this position is terrible.

Anyway, this is not a problem. Moving on. . .

Sequence Length Distribution

192 PASS, 0 WARN, 0 FAIL

This test makes sure all the reads are the same length and will fail if even one base is not the correct length. Since everything passed, everything should be the same length (which is 101 bases for both old and new samples).

Sequence Duplication Levels

0 PASS, 0 WARN, 192 FAIL

This counts the number of duplicated reads and will fail if more than 50% are duplicated. I have analyzed duplication in the previous post, all that needs to be said here is that because this is an RNA-seq dataset, we expect a lot of duplication so this result is unsurprising.

Overrepresented sequences

0 PASS, 186 WARN, 6 FAIL

I will address this one in the next post in this series.

Kmer Content

0 PASS, 185 WARN, 7 FAIL

Detects if there’s an enrichment of any particular 5-mer in the dataset. Due to bias in “random” priming and high duplication rate, it’s likely that many 5-mers are overrepresented. As such, I won’t take the time to look at this carefully. At least not at this point.

Saturday, January 10, 2015

An in-depth look at data quality (Part 1)

I’ve talked about FastQC in a previous post but now I’ve actually use it to analyze our data. We want to make sure the data we have is of sufficient quality before we do anything with it, so it’s important to do this analysis early.

Note: I put all relevant documents on the Google Drive in the folder Scott>Quality Analysis

Here’s what I did:

1) Run FastQC on all .fastq files (96 x 2 reads per sample = 192 files)

2) Run Sambamba to align files to reference genome (to be discussed in future post)

3) Run Sambamba to extract alignment statistics

4) Made a script that runs in the terminal to extract important statistics from the output of 1) and 3) (uploaded to the drive as fastqc_extraction.sh)

The output of my script is called fastq_results.xlsx on the drive. On the first worksheet of this file I have listed useful values and whether or not a sample passes or fails the standard FastQC tests. The second worksheet is just a list of every sequence FastQC has labelled as “overrepresented”.

The first thing I did was compare some numbers between the old and new RNA seq datasets in R. The script that makes this plot is called quality_analysis_script.R (and requires the script multiplot.R)

Number of reads (i.e. number of DNA fragments sequenced in each sample)

The graph suggests that the older samples tend to have more reads. Older samples have, on average, 1.5 million more reads. A t-test says that there is a significant difference with a p-value of 1e-08. Although 1.5 million sounds like a lot, keep in mind that some of the older samples differ by close to 6 million reads.

The top 4 files for reads are:

toxx_M0_A.R1.fq.gz – 10651839

toxx_M0_A.R2.fq.gz – 10651839

antx_M0_A.R1.fq.gz – 11311458

antx_M0_A.R2.fq.gz – 11311458

The bottom 4 files are:

antx_M2_C.R1.fq.gz – 1781164

antx_M2_C.R2.fq.gz – 1781164

antx_M2_E.R1.fq.gz – 4035388

antx_M2_E.R2.fq.gz – 4035388

Number of duplicates (i.e. reads that do not have a unique sequence in the sample)

The graph suggests that the newer samples tend to have more duplicated reads. Newer samples have, on average, 4% more duplicates. A t-test says that there is a significant difference with a p-value of 4e-11.

Duplicate reads are caused by sequencing two PCR products derived from the same fragment (technical duplicates) or by sequencing the same region of two distinct RNA molecules (biological duplicates). Technical duplicates are unwanted whereas biological duplicates are pretty much the basis of differential expression analysis. It is impossible to discriminate the two types of duplicates with the data we have.

So it looks like most samples fall above 80% duplicates – which sounds like a lot (only 20% of reads are actually unique). However, remember that only one of the two paired end reads is actually being analyzed here so all reads that begin at the same nucleotide are considered duplicates. Because we’re doing RNA seq here, most reads come from the transcriptome (a subset of the whole genome) and in some samples, there are 3x more reads than nucleotides in the genome; extremely high duplication (80-90%) is reasonable.

The top 4 files for duplicated % are:

kw20_M3_C.R1.fq.gz – 89.1

toxx_M1_D.R1.fq.gz – 88.6

kw20_M3_C.R2.fq.gz – 88.0

toxx_M1_D.R2.fq.gz – 87.9

The bottom 4 files are:

kw20_M0_B.R2.fq.gz – 75.8

sxyx_M1_B.R2.fq.gz – 75.5

antx_M2_C.R1.fq.gz – 60.4

antx_M2_C.R2.fq.gz – 58.7

Note the bottom two. They have a lot of contaminated sequences, so it’s not surprising that almost half of sequences are unique.

GC Percent (i.e. % of nucleotides that are G or C)

There is on average 1% more GC in the old data than the new (p=2e-08). H. influenzae has a GC % of 38% and Rosie has claimed that genes tend to be higher. It seems that the older dataset is much more variable (except for 4, 2 of which have been shown to be contaminated so far)

The top 4 files for GC % are:

kw20_M3_C.R1.fq.gz – 46

kw20_M3_C.R2.fq.gz – 45

antx_M2_C.R1.fq. gz – 45

antx_M2_C.R2.fq.gz – 45

The bottom 4 files are:

Most files are lowest at 40%

% Alignment (i.e. % of reads that can be mapped back to the H. influenzae genome)

Note that for antx_M2_C, only 20% reads aligned to the reference genome. This sample was not included in the graph. On average, reads from the new samples get aligned better than reads from the old sample by 0.5%. This is significant (p=1e-10) once the poorly aligned sample is removed.

This means that the newer samples could be purer or the quality of sequencing was higher in the new sample (base calling errors make it harder to align back to a reference sequence) or maybe something else I haven’t thought of.

The top 4 files for GC % are:

sxy1_B2_G.R1.fq.gz – 99.42

sxy1_B2_G.R2.fq.gz – 99.42

sxyx_M0_D.R1.fq.gz – 99.43

sxyx_M0_D.R2.fq.gz – 99.43

The bottom 4 files are:

antx_M2_C.R1.fq.gz – 20.1

antx_M2_C.R2.fq.gz – 20.1

toxx_M2_A.R1.fq.gz – 97.1

toxx_M2_A.R2.fq.gz – 97.1

Overall:

-The old data and the new data are different. Statistics can tell me that, however I don't see any major problem. Sample antx_M2_C is what I would expect a problem would look like. When samples are prepared by different people on different days, there is bound to be technical variation.

- Sample antx_M2_C is bad. I feel it may cause problems to keep in in any future analysis steps (but we can try anyway)

- Looking at the graphs, my prediction is that the newer data is either less contaminated or contains more PCR duplicates, because the graphs suggest that there is more H. influenzae DNA in the newer sample (better alignment, GC content has low variability)

So far, this analysis suggests that the new data is probably as good as (if not better) than the old data. In the next post I will look at the FastQC test results.

Thursday, January 8, 2015

A first look at the up-regulated and down-regulated genes in the toxin and antitoxin mutants

First Look: missing down-regulated genes, and an upregulated hsd Type I restriction modification enzyme complex

Three of the strains sequenced for the MIV time course experiments have deletions in the CRP-S operon consisting of genes HI0659 and HI0660.

HI0659 seems to function as an antitoxin that prevents the HI0660 toxin from inhibiting DNA uptake and transformation. Cells without HI0659 are not transformable, but grow normally, while cells without HI0660 seem to grow and transform normally.

Previous RNA sequencing results showed that knocking out the antitoxin seemed to slightly decrease sxy expression and expression of competence genes (expression patterns were intermediate between wild type and sxy- strains in most cases), but increase toxin expression. Knockout of the toxin seemed to slightly increase expression of some competence genes. The HI0659/HI0660 deletion strain was not tested in the last RNAseq.

I have looked at the files Scott put together showing which genes are up-regulated and down-regulated significantly (with a false discovery rate <0.05) for the toxin deletion, antitoxin deletion, and toxin/antitoxin operon deletion strains.

The first thing that jumped out at me is that the antitoxin and toxin/antitoxin deletion strains don't show significant down-regulation of the deleted genes on these lists. Since the genes have been deleted, we know they shouldn't be expressed at all, whereas during competence they are normally up-regulated over time, so we should see them appear on the list of down-regulated genes relative to wild-type at the later time points. It's possible this means that the criteria for being "significantly" down-regulated are too stringent, but it's also possible that this result is an artifact of how the genes were knocked out.

The genes were knocked out using a spec cassette that replaces the majority of the coding sequence of the gene. In the case of the toxin HI0660, which is transcribed first in the operon, everything between the start codon and the last 7 codons is replaced, and in the antitoxin HI0659, everything between the first 3 codons and the last 7 codons is replaced. The first 3 codons are retained because they overlap with the coding sequence of HI0660. In the previous RNA sequencing, I suspect this inclusion of a few codons from the beginning of the gene caused HI0659 to appear expressed in the HI0659 knockout strain when the data was analysed (see yellow line in image 1 below), even though looking at the coverage of bases across the whole gene confirmed that the middle of the gene was knocked out properly.

Image 1: Expression level (number of reads normalized for gene length) of the gene HI0659 over time in the sequenced strains. Strains were HI_0659 = antitoxin deletion, HI_0660 = toxin deletion, KW20_A and _B = two replicates of wild type, SXY-_A and B = two replicates of sxy gene knockout

The graph for HI0660 appears as it should (see green line in image 2 below).

Image 2: Expression level (number of reads normalized for gene length) of the gene HI0660 over time in the sequenced strains. Strains were HI_0659 = antitoxin deletion, HI_0660 = toxin deletion, KW20_A and _B = two replicates of wild type, SXY-_A and B = two replicates of sxy gene knockout

I think this may be what's happening in the new data as well, since the file for genes down-regulated in the HI0660 deletion mutant includes HI0660 at the T=10 min and T=30 min time points as expected. This issue definitely needs to be looked at before any further analysis is done since other genes knocked out using the same spec cassette recombineering method may show similar patterns depending on which part of the gene was replaced.

Even apart from this issue, I didn't see much consistency in the genes that were up- and down-regulated across the different time points in the toxin- and antitoxin-deletion strains. For example, sxy did not appear to be downregulated in any of the antitoxin-deletion time points, nor did the toxin appear up-regulated as seen before, and nothing was down-regulated at t=30.

The one consistent result that jumped out at me is that when the whole toxin-antitoxin operon is knocked out, all three subunits of the hsd Type I restriction modification complex (genes 1285, 1286, and 1287) are upregulated across essentially all time points (only 2/3 appear upregulated at t=100, but at all other time points they're all up-regulated). These are subunits of a "host specificity of DNA" Type I restriction complex, which in E. coli recognises a specific methylation site and can both methylate hemi-methylated host DNA strands (for example, after damage repair), and cleave non-methylated foreign DNA at a non-specific site [B Sain, NE Murray. The hsd (host specificity) genes of E. coli K12. Molecular and General Genetics MGG, 1980]. Since the number of reads for each of these genes remains roughly consistent across all the time points (and they are already up-regulated at time 0), this looks like it could be an effect of the toxin antitoxin system that is not specific to competence. It's unclear to me at the moment why there would be an effect on these genes only when both the toxin and antitoxin are deleted though.

Result of the BLAST searches on that problem file

Here's the command I used (taken from Scott's example):

gunzip -cd 9C30_CGTACG_L004_R1_001.fastq.gz | head -32

Here are the first 8 reads:

GCCCGATGCGGAAACCGGTGAAGGCCTGAAGAATCTCGATTACGCCTATATCGCCGCCGAGCTCGGCAAGAACCCGCTCGCCCCCCAGACGATGAACTGCT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TCGGCCTTCAAAGTTCTCATTTGAATATTTGCTACTACCACCAAGATCTGCACCGACGGCCGCTCCGCCCGGGCTCGCGCCCCGGGGTTTGCAGCGGCCGC
CCCACCTATCCTACACTCAGTTGCACAAAAGGATGTGAACAAGACCTGGTAATTACAGCATTATAAACACACACACACATGCTCGCAAACCTCAAAAATGA
CTACTCTTGACATCCTAAGAAGAGCTCAGAGATGAGCTTGTGCCTTCGGGAACTTAGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGT
CCTTTCAACAATTTCACGTACTTTTTCACTCTCTTTTCAAAGTTCTTTTCATCTTTCCATCACTGTACTTGTTCGCTATCGGTCTCTCGCCAATATTTAGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AATAATATGGTTTGGTTGTTTGTCCCCTTCAAATCTCATGTTGAAATATCATCCCGTGTTGGAGGTGGGGCCTGGTGGAAGGTGTTTGGATCATGGGGATA

OK, two are just strings of As, consistent with what Scott saw. Let's see what the others are and ow closely they match the best BLAST match:

1. Rhodopseudomonas palustris HaA2, complete genome 100%

3. Lots of strong matches, probably due to the repeats

4. Homo sapiens chromosome 11, clone RP11-324K6, complete sequence 100%

5. Haemophilus influenzae strain Hi375, complete genome 100%

6. Saccharomyces cerevisiae strain NDL373 26S ribosomal RNA gene, partial sequence 100%

8. Human DNA sequence from clone RP13-346H10 on chromosome X, complete sequence 100%

Here are the next 8 reads:

9.AAAATAGAAAGACCTGTCCCTGTAGAAAACCCCCTGAGCAAAGTAGTCTCTGGAAACCGTCCCTCACCCTCATGTGGAAGTGTGGGTGGGGGGATATTGGT

10.CNCGTTGCCACCATGGTAGGCCACTATCCTACCATCGACAGTTGATAGGGCAGAAATTTGAATGAACCATCGCCGGCGTAAAGCCATGCGATTCGACAAGT

11.CGCCCAGCTAATTTTTGTATTTTTAGTAGAGTCAGGGTTTCACCATGTTAGCCAGGCTGGTCTCGAACGCCTGACCTCAGGTGATCCACCTGCCTCAGGCT

12.ACCCTTCCCCCCTTACCTCAATGGCTGCCTCCCTCACACCCTTCAGCACTCCTCATATGTGTCATCGGCTTCTCCATGAGGCCTCCCTGGGCCCCCCCCCC

13.CGGGTCACTATGACCTACTTTCGTACCTGCTCGACTTGTCTGTCTCGCAGTTAAGCTTGCTTATACCATTGCACTAACCTCACGATGTCCGACCGTGATTA

14.TTTCCAGCAGCATGTACGGGGTGCCGGTGCTCAGCTGGGTCTTCAGCACAGCCCAGTGGTCCTGCAGGGTGATGCCGGCCACCACTTCGCAGCCCACGGCC

15.CCAAGCCATACCTGATCCTTGATCTAAGGAAAATGGGAGCTGATAAGTGTGTGCTGTTTTCAGCTGCCAAGTGTGGGACCCTTTGTGATGCTGGAGTCAGT

16.TGGGACGTAATCAACGCAAGCTGATGACTTGCGCTTACTAGGGATTCCTCGTTGAAGAGCAATAATTGCAATGCTCTATCCCCAGCACGACAGAGTTTAAC

And their BLAST results:

9. Human chromosome 14 DNA sequence 100%

10. Pichia sp. enrichment culture clone TBA19 100%

11. Human DNA sequence from clone RP1-150O5 100%

12. Homo sapiens 12p11-37.2-54.4 BAC RP11-1110J8 100%

13. Haemophilus influenzae strain Hi375, complete genome 99%

14. Stenotrophomonas maltophilia D457 complete genome 97%

15. Homo sapiens 12 BAC RP13-714J12 100%

16. Graphium sp. UCR2325 18S ribosomal RNA gene, partial sequence 100%

Conclusion: The problem isn't that the original RNA sample was contaminated with another species of bacteria, or that the rRNA removal step didn't work. But it might have been that there was so little mRNA that contamination made a much larger contribution than normal.

I guess this is good news. There's no reason to discard this sample, we'll just have fewer reads to work with than in the other samples.

I need help sorting out the files

I was going to write a post about what I just learned from Scott about file quality (this would duplicate the info in his post but make sure I understood it), and then pull out a few reads from the problem sample he noticed and BLAST them to see if they align to anything.

But I've discovered that first I need to get the various files sorted out and properly named. I have the fastq files from the sequencing centre on a little USB backup drive plugged into my laptop, but these files have the original confusing names, not the new informative names.

I also have a nice table made by Josh that shows both names for each sample, so in principle I could just carefully rename my files.

But issues:

There are 72 files and that's a lot of opportunities for error. If Scott has the renamed files on the Zoology cluster, he could put a copy of them into my directory there, and then I could download them onto my backup drive. That would be inefficient but safe.
Should I really be working with files on the Zoology cluster, rather than local files?
I want to get the 24 RNAseq files from last year too.

While waiting for advice I think I'll go ahead and use Josh's table to identify the original name of the problem sample (new name antx_M2_C, old name Sample_9C30), and pull out the reads and do those BLAST searches.

Basic Analysis 2 - FASTQ and Base Quality

This is the second post in a series of posts that will detail the initial steps in analyzing the data. This pipeline is standard for next-generation sequencing data.

The output from an Illumina platform is the FASTQ file. To understand this file, we’ll jump right into a real example from our data:

QD_0_ACTTGA_L005_R1_001.fastq.gz

First thing to note is the name given by the Illumina platform. The file name is composed of 6 fields separated by underscores:

- QD_0 = sample name (defined by our lab, in this case Δhfq in MIV at t=0)

- ACTTGA = barcode sequence ligated to each DNA fragment (unique to each sample)

- L005 = sequencing lane on the flow cell

- 001 = set number

Each file is can only hold a certain amount of sequencing data and anything exceeding that size is split into multiple files denoted by the set number. This means that files that differ only by set number are part of the same sample and must be combined before analysis.

FASTQ files have an extension of either .fastq or .fq but as you may have noticed, the file name I gave above is technically not a FASTQ file. The .gz extension denotes that this file is compressed by the Gzip algorithm to minimize the space each file takes up.

To unzip the file, simply navigate to the file in a terminal window and use the following command:

gunzip <filename>.gz

To gzip/compress a file, use this command:

gzip <filename>.gz

It is not recommended to unzip the FASTQ file unless necessary (most bioinformatics software can work on the zipped file) to keep file sizes as small as possible. However, certain actions such as viewing the file will require you to first unzip it. To view the first 12 lines of the FASTQ file without unzipping the whole file, do this:

gunzip -cd <filename>.gz | head -12

For the file I mentioned above, the output looks like this:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA

GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA

CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>

@HWI-ST765:225:C5K6GACXX:5:1101:1365:2094 1:N:0:ACTTGA

TTTCAAAGTCATAATTGATTAATTTGAGTTTATTTGGATCAAATGCAGTTGGCGATTGTTCTGCGGTCGTGTTAGTTAAGATTTGTAAAGAATCCCCTTGT

CCCFFFFFGHHHHJJIJIJIJHIJIIIGHIJIJJJJIIJFHIFJJJIJGGIIJIIIHGGHIIEH/B@BHHFBD@CCAC;>CECCCACEDC>ADD@CDDDDA

@HWI-ST765:225:C5K6GACXX:5:1101:1435:2137 1:N:0:ACTTGA

AGCCCTTAACCGTATTTATACGACCTAGTGGGACAAGTAAACGAGAGGAAAGTCCGAGCTACACAGGGCAGAGTGCCGGATAACGTCCGGGCGGCGTGAGC

CCCFFFFFHHHHDGIJJJJIJJJJJJJJIJIIJJJJIFGIJJJJJJJJJIJJHHIIJHHFFFFFEDDDDDDDDCDACBDBDDDDDDDDDDDBDBBD<@BD9

This is what the guts of a FASTQ file looks like. Each read is contained in 4 lines of the file. Since this file is 12 lines, it describes 3 reads.

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA

The first line always begins with @ and must be a unique identifier of that read. The information contained here is given by the Illumina platform and tells the exact X and Y coordinate on the flow cell where this read was sequenced. For more detailed information, go here.

GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA

The second line contains the raw sequence for the read.

The third line is a “+” followed by an optional description (nothing, in this case).

CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>

The final line is the quality score for each base. Each score is ASCII encoded (to save space), where the higher the score, the higher the quality. The following table can be used to calculate the score. Take the value of the character and subtract 33 to find the quality score.

Ex. for the character C, the quality score is 34 (67 – 33 = 34).

But what does this score mean? The score is actually a Phred quality score where:

Q = –10 log P

Q = quality score

P = probability base was incorrectly called

A Phred score of 34, therefore, means that a base has a 99.96% chance of being correct whereas a score of 10 has a 90% probability of being correct. We can use this scoring system to evaluate the quality of our sequencing results. An efficient way of doing this is using the tool FastQC. To run this tool, install the software and simply use the following command:

fastqc <filename>

The program computes and graphs a bunch of statistics on the quality of the data.

This image shows quality distribution at each base position in a read. As typical of Illumina sequencing data, the quality drops towards the end of the read due to phasing and pre-phasing (see the first post)

Overall, FastQC shows that most bases have a quality score greater than 30 (0.1% chance of error for each base). This suggests that the sequencing data has sufficiently high quality. For more information on the output from FastQC, please see this PDF.

I will upload the FastQC results and write a post on it eventually which will be linked here.

The next post in this series will attempt to detail how to map these reads to a reference genome.