Computational identification of protein binding sites on RNAs using high-throughput RNA structure-probing data

Xihao Hu, Thomas K. F. Wong, Zhi John Lu, Ting Fung Chan, Terrence Chi Kong Lau, Siu Ming Yiu and Kevin Y. Yip

Abstract

Motivation

High-throughput sequencing has been used to probe RNA structures, by treating RNAs with reagents that preferentially cleave or mark certain nucleotides according to their local structures, followed by sequencing of the resulting fragments. The data produced contain valuable information for studying various RNA properties.

Results

We developed methods for statistically modeling these structure-probing data and extracting structural features from them. We show that the extracted features can be used to predict RNA "zipcodes" in yeast, regions bound by the She complex in asymmetric localization. The prediction accuracy was better than using raw RNA probing data or sequence features. We further demonstrate the use of the extracted features in identifying binding sites of RNA binding proteins from whole-transcriptome gPAR-CLIP data.

Availability

The source code of our implemented methods is available below.

Errata

There is a typo in Table 2, where the starting location of WSC2C zipcode is incorrect. As the correct value was used in the source code, it has no effect to our findings.

Table 2. List of zipcodes used in our prediction task

Zipcode Gene Location in gene Length Source
E1min Ash1 635-683 49 (Jambhekar et al., 2005)
E2A Ash1 1109-1185 77 (Olivier et al., 2005)
E2Bmin Ash1 1279-1314 36 (Jambhekar et al., 2005)
Umin Ash1 1766-1819 54 (Jambhekar et al., 2005)
EAR1-1 Ear1 1572-1621 50 (Olivier et al., 2005)
ERG2N Erg2 180-250 71 (Jambhekar et al., 2005)
SRL1C Srl1 419-596 178 (Jambhekar et al., 2005)
TPO1N Tpo1 2-178 177 (Jambhekar et al., 2005)
WSC2C Wsc2 1354-1384 1313-1384 31 72 (Jambhekar et al., 2005)
WSC2N Wsc2 418-471 54 (Jambhekar et al., 2005)

Materials

Source code and required data sets to reproduce the work:

Download both files and go with following commands from your shell:

tar zxvf ProbRNA*.tar.gz
unrar e PARS_DATA_SET.rar
mv PARS_DATA_SET public/work/
cd public
sh Produce.sh all &

It will run in the background and take within a day to finish.

Please refer to ReadMe.txt in the archive for further details.

You can also download the outcome from our statistical model from Processed PARS data (17.2M)

The columns 'v1' and 's1' are the raw read counts from V1 and S1 enzymes. The columns 'pbv' and 'pbs' are the probabilities of the hidden status from the two read counts, respectively.

Useful Links

(Last update: May 2015)