Xihao Hu, Thomas K. F. Wong, Zhi John Lu, Ting Fung Chan, Terrence Chi Kong Lau, Siu Ming Yiu and Kevin Y. Yip
High-throughput sequencing has been used to probe RNA structures, by treating RNAs with reagents that preferentially cleave or mark certain nucleotides according to their local structures, followed by sequencing of the resulting fragments. The data produced contain valuable information for studying various RNA properties.
We developed methods for statistically modeling these structure-probing data and extracting structural features from them. We show that the extracted features can be used to predict RNA "zipcodes" in yeast, regions bound by the She complex in asymmetric localization. The prediction accuracy was better than using raw RNA probing data or sequence features. We further demonstrate the use of the extracted features in identifying binding sites of RNA binding proteins from whole-transcriptome gPAR-CLIP data.
The source code of our implemented methods is available below.
Zipcode | Gene | Location in gene | Length | Source |
E1min | Ash1 | 635-683 | 49 | (Jambhekar et al., 2005) |
E2A | Ash1 | 1109-1185 | 77 | (Olivier et al., 2005) |
E2Bmin | Ash1 | 1279-1314 | 36 | (Jambhekar et al., 2005) |
Umin | Ash1 | 1766-1819 | 54 | (Jambhekar et al., 2005) |
EAR1-1 | Ear1 | 1572-1621 | 50 | (Olivier et al., 2005) |
ERG2N | Erg2 | 180-250 | 71 | (Jambhekar et al., 2005) |
SRL1C | Srl1 | 419-596 | 178 | (Jambhekar et al., 2005) |
TPO1N | Tpo1 | 2-178 | 177 | (Jambhekar et al., 2005) |
WSC2C | Wsc2 | |
|
(Jambhekar et al., 2005) |
WSC2N | Wsc2 | 418-471 | 54 | (Jambhekar et al., 2005) |
Source code and required data sets to reproduce the work:
tar zxvf ProbRNA*.tar.gz
unrar e PARS_DATA_SET.rar
mv PARS_DATA_SET public/work/
cd public
sh Produce.sh all &
It will run in the background and take within a day to finish.
Please refer to ReadMe.txt in the archive for further details.
You can also download the outcome from our statistical model from Processed PARS data (17.2M)
The columns 'v1' and 's1' are the raw read counts from V1 and S1 enzymes. The columns 'pbv' and 'pbs' are the probabilities of the hidden status from the two read counts, respectively.
(Last update: May 2015)