This is the first of a series of posts that will detail the initial steps in analyzing the data. This pipeline is standard for next-generation sequencing data.
Before we can jump into any data, it is important to understand how the data is obtained. It is therefore useful to understand how to prepare the cDNA library and how the sequencing works.
Here are the steps needed to set up a sample to be sequenced:
I will emphasize a some of these steps that I feel may need explanation.
3. Synthesize cDNA by random hexamer priming
To convert RNA to cDNA, you must anneal a primer to the RNA strand and use the enzyme reverse transcriptase to synthesize a cDNA strand. In eukaryotes, the poly-A tail at the end of mRNA strands is exploited by the use of a poly-T primer to synthesize the cDNA. For species that don’t attach a poly-A tail to mRNA (such as H. influenzae), a strategy called random hexamer priming is used. A pool of random 6 nucleotide primers (4⨯4⨯4⨯4⨯4⨯4 = 4096 different sequences) is used to initiate cDNA synthesis in (theoretically) random spots along an RNA strand. Note that this priming is not truly random – for instance RNA secondary structure may make certain regions of the RNA more available for binding.
4. cDNA shearing and size selection
This step is used to restrict the cDNA library to a certain size range. I’m not sure how the people who did our sequencing sheared the DNA, but typically it’s done by sonication and the desired DNA size is purified by gel extraction.
5. Addition of adapters
Next, segments of synthetic DNA are ligated to the ends of the cDNA library. For every cDNA fragment, you get something that looks like the following:
A) Two segments are added that allow the fragment to bind to the flow cell (a small glass slide with small oligonucleotides grafted to it; DNA sequencing happens on the flow cell)
B) A short index sequence which acts as a barcode for each fragment. In our sequencing job, each sample had a unique index sequence. This was done so that could pool samples together for sequencing and still associate each fragment with one unique sample.
C) Two regions are added that are complementary to Illumina sequencing primers used to initiate sequencing. Since we used paired-end sequencing two sequences are needed, one at each end (more info below).
6. PCR amplification
This step increases the size of the cDNA library so that it is sufficient for sequencing. This problem of this step is that it can cause sequencing duplications where amplified versions of the same fragment are sequenced multiple times. For more information, check out this post. In many sequencing projects, the duplicates are identified and thrown out - however, the general consensus for RNA seq data is that the duplicates should not be thrown out without a good reason.
7. Illumina sequencing
There are plenty of resources online that explain how Illumina sequencing works. The general idea is that DNA synthesis is controlled such that an image is created after the addition of every single (fluorescent) nucleotide. By seeing how the fluorescence changes over time, a sequence can be derived.
I will focus more on the elements that have consequences to our data.
First, as mentioned above, we have paired-end sequencing data. This means that we actually sequenced each cDNA fragment twice – once per strand. Since DNA polymerase can only synthesize in one direction, the sequencing is initiated at opposite ends of each strand and is directed toward the middle (see diagram in section 5.) At this point, each one of these sequences is called a “read”. These two reads a complementary to each other and may or may not overlap depending on the size of the fragment being sequenced and the length of each read.
Another thing that should be mentioned is that the length that Illumina technology can sequence is limited. Because the platform sequences a cluster of fragments and relies on DNA synthesis, sometimes individual fragments fall out of sync by failing to add a base or by adding multiple bases. This problem accumulates with the addition of every base and eventually it becomes impossible to identify which base is added because all the fragments in a cluster are adding a different base. This issue is known as phasing or pre-phasing.
These are the essentials needed to understand the library construction and sequencing. Next post will be about the sequencing output and quality.