I’ve been trying to identify CRP-regulated genes using the
RNA-seq data (comparing crpx and KW20 and looking for differently expressed
genes). Along with this, I’ve been trying to identify CRP binding motifs in the
H. influenzae genome. For finding CRP
motifs, I’ve put together an iterative processes that uses motif-finding
software.
1) A sequence motif is created by giving MEME a set of known/predicted 22bp CRP motifs.
2) Using the motif created above, FIMO is used to find occurrences of the motif in a genome.
3) A custom R script takes all the genes in the genome and
finds predicted CRP sites upstream of the gene (currently set to between 0 and
300 bp upstream). Predicted CRP sites that are found upstream of genes thought
to be CRP-regulated are used as input for step 1. The the process repeats for as
long as desired.
To test if this works, I tried it with CRP sites in E. coli since it is much better studied.
RegulonDB gives >200 candidate CRP binding sites. I chose
8 randomly:
TTGTGTGATCTGCATCACGCAT
TATCGTGACCTGGATCACTGTT
ATCTGCATCGGAATTTGCAGGC
GACCTCGGTTTAGTTCACAGAA
AATCTTTATCTTTGTAGCACTT
AATTTTACTTTTGGTTACATAT
TTATATATGTCAAGTTGTTAAA
TTACCCGCTTTAAAACACGCTA
I put these 8 sequences into the pipeline I created. Here
are the results (note: scale on histogram is not constant):
ITERATION 1:
34/139 genes with known CRP start sites correctly predicted
24/181 CRP sites predicted correctly
286 other genes predicted to have CRP sites (7.1% of genome)
– treat as false positives
62/139 genes with known CRP start sites correctly predicted
46/181 CRP sites predicted correctly
431 other genes predicted to have CRP sites (10.8% of
genome) – treat as false positives
ITERATION 3:
78/139 genes with known CRP start sites correctly predicted
63/181 CRP sites predicted correctly
530 other genes predicted to have CRP sites (13.2% of
genome) – treat as false positives
ITERATION 4:
86/139 genes with known CRP start sites correctly predicted
69/181 CRP sites predicted correctly
552 other genes predicted to have CRP sites (13.8% of
genome) – treat as false positives
From 8 sequences alone, I am able to predict ~ one third of
known E. coli CRP sites (albeit with a pretty high false positive rate). After
each iteration, the motif logo begins to look more and more like the CRP
binding motif (in particular after the first iteration). It also seems that
predicted CRP sites tend to cluster more between 0 and 300 bp upstream of the
gene after each iteration.
Starting off with the following 30 sequences gives better
results:
TTGTGTGATCTGCATCACGCAT
TAAAGTGATGGTAGTCACATAA
AAACTGAGACTAGTACGACTTT
AGTGGGATTAATTTCCACATTA
TAATTTCCACATTAAAACAGGG
GGCTGTGCTGCGCATAATACTT
TATCGTGACCTGGATCACTGTT
AATTTTGCGCTAAAGCACATTT
ATCTGCATCGGAATTTGCAGGC
AATAGTGACCTCGCGCAAAATG
AAACGTGATTTAACGCCTGATT
TCTCGTGATCAAGATCACATTC
ATTTGTGATGAAGATCACGTCA
TAACGTGATGTGCCTTGTAATT
AAACGTGATCAACCCCTCAATT
TTTTGCAAGCAACATCACGAAA
GACCTCGGTTTAGTTCACAGAA
TTATGTGACAGATAAAACGTTT
TAATGTGATTTATGCCTCACTA
ATTTGAGAGTTGAATCTCAAAT
TTATGTGATTGATATCACACAA
TCTTATGACGCTCTTCACACTC
CTTTGTGATGTGCTTCCTGTTA
AACAGTTATTTTTAACAAATTT
AAATGTATGACAGATCACTATT
AAATGTTATCCACATCACAATT
AATCTTTATCTTTGTAGCACTT
AATTTTACTTTTGGTTACATAT
TTATATATGTCAAGTTGTTAAA
AGTTGTTAAAATGTGCACAGTT
AAATGTGCACAGTTTCATGATT
TTTCATGATTTCAATCAAAACC
TTACCCGCTTTAAAACACGCTA
ITERATION 1:
87/139 genes with known CRP start sites correctly predicted
85/181 CRP sites predicted correctly
488 other genes predicted to have CRP sites (12.2% of
genome) – treat as false positives
ITERATION 6:
98/139 genes with known CRP start sites correctly predicted
92/181 CRP sites predicted correctly
532 other genes predicted to have CRP sites (13.3% of
genome) – treat as false positives
No comments:
Post a Comment