Computational identification of protein binding sites on RNAs using high-throughput RNA structure-probing data

Xihao Hu, Thomas K. F. Wong, Zhi John Lu, Ting Fung Chan, Terrence Chi Kong Lau, Siu Ming Yiu and Kevin Y. Yip

Abstract

Motivation

High-throughput sequencing has been used to probe RNA structures, by treating RNAs with reagents that preferentially cleave or mark certain nucleotides according to their local structures, followed by sequencing of the resulting fragments. The data produced contain valuable information for studying various RNA properties.

Results

We developed methods for statistically modeling these structure-probing data and extracting structural features from them. We show that the extracted features can be used to predict RNA "zipcodes" in yeast, regions bound by the She complex in asymmetric localization. The prediction accuracy was better than using raw RNA probing data or sequence features. We further demonstrate the use of the extracted features in identifying binding sites of RNA binding proteins from whole-transcriptome gPAR-CLIP data.

Availability

The source code of our implemented methods is available below.

Errata

There is a typo in Table 2, where the starting location of WSC2C zipcode is incorrect. As the correct value was used in the source code, it has no effect to our findings.

Table 2. List of zipcodes used in our prediction task

Zipcode	Gene	Location in gene	Length	Source
E1min	Ash1	635-683	49	(Jambhekar et al., 2005)
E2A	Ash1	1109-1185	77	(Olivier et al., 2005)
E2Bmin	Ash1	1279-1314	36	(Jambhekar et al., 2005)
Umin	Ash1	1766-1819	54	(Jambhekar et al., 2005)
EAR1-1	Ear1	1572-1621	50	(Olivier et al., 2005)
ERG2N	Erg2	180-250	71	(Jambhekar et al., 2005)
SRL1C	Srl1	419-596	178	(Jambhekar et al., 2005)
TPO1N	Tpo1	2-178	177	(Jambhekar et al., 2005)
WSC2C	Wsc2	~~1354-1384~~ 1313-1384	31 72	(Jambhekar et al., 2005)
WSC2N	Wsc2	418-471	54	(Jambhekar et al., 2005)

Materials

Source code and required data sets to reproduce the work:

Source code v20131213 (307K)
Compressed PARS data (14.2M)

Download both files and go with following commands from your shell:

tar zxvf ProbRNA*.tar.gz unrar e PARS_DATA_SET.rar mv PARS_DATA_SET public/work/ cd public sh Produce.sh all &

It will run in the background and take within a day to finish.

Please refer to ReadMe.txt in the archive for further details.

You can also download the outcome from our statistical model from Processed PARS data (17.2M)

The columns 'v1' and 's1' are the raw read counts from V1 and S1 enzymes. The columns 'pbv' and 'pbs' are the probabilities of the hidden status from the two read counts, respectively.

Useful Links

(Last update: May 2015)