p-TAREF (Project funded by DBT, Govt. of India.)
p-TAREF stands for plant TArget REFiner. p-TAREF refines the process of microRNA target
identification through incorporation of local interaction information for microRNA and target region in the target sequence
as well as estimates the candidature of being a true microRNA target on the basis of intrinsic sequence property around the
target site determined by varying dinculeotide density profile. Also by considering dinculeotide density we incorporate the
nearest neighbor approach of RNA structure determination.
Therefore p-TAREF works with two major assumptions:
- Interaction information between microRNA and target is more important than sequence information and such interactions are highly specific to a given microRNA. Experimentally validated interactions can be used to filter out similar interactions in the predicted system to define a preliminary filtering process.
- Varying dinucleotide density profiles in the flanking regions around the target region behaves as a dynamic and
reasonable signature to determine the target sites. Such parameters grossly cover the intrinsic property of the nearby region and sequential arrangement of this varying profile around the target region captures the discrimination.
Basically, p-TAREF takes the output of RNAhybrid as the template.
p-TAREF runs in two parts. In first, it considers the encoded interaction pattern library
generated from experimentally reported interactions for plants. These encoded interaction patterns were derived after careful
alignment of interaction partners, taking a bit extended target region to get more accurate alignment. The parameters used during this step are retained and used during derivation of encoded patterns from the interactions predicted by RNAhybrid.
Once the encodes are generated from the output of RNAhybrid, the library of experimental
encoded patterns is used for scanning for similar interactions in the predicted interactions. At this level itself lots of
candidates can be removed. The selected candidates are subjected to the support vector scan step. The support vector
regression machine utilizes the information of intrinsic sequence property of the flanking regions around the target region
in the form of varying dinculeotide density profiles at various intervals. We found that the peak discrimination capacity of
SVR models was observed when window of 20 bases were considered. Based on the cassification, the SVR machine assigns a score
to all the targets found.A positive score indicates potential positive target whereas negative scores indicate non targets.
Our methodology on which p-TAREF is based, has attained accuracy level of ~95% with high level of sensitivities and specificities and Area Under Curve for ROC analysis were above 0.9. Details are present in performance tab.
Using Java Concurrent Library(JCL), our tool p-TAREF can run on multiple
processors, reducing total execution time. p-TAREF creates multiple instances of
RNAHybrid equal to number of processors.The outfile from RNAHybrid is input for following steps, which also run concurrently like RNAHybrid.
Work flow of p-TAREF on four processors
All the time consuming steps of RNAHybrid, alignment and predicted pattern matching with
experimentally validated patterns with specified cut off. I built codes for distribution, alignment, parsing etc run concurrently on
multiple processors. The final classification and scoring of targets is done by the associated SVR.
Support Vector Regression
SVR (Support Vector Regression ) or Support Vector Machine for regression is a machine learning technique to solve regression problem by SVM. SVM classifies data into positive and negative based on a model, SVR also classifies the data like SVM and gives the score to each classification showing the accuracy of classification. p-TAREF uses SVMTorch for SVM classification and regression score.
Data set generation:
For our model, we download 104 sequences from TAIR relaese 10 for Arabidopsis thaliana . For negative dataset instances were taken from TAIR as well as
ENSEMBLE Release 62 (Supplementry1.doc), in order to create our testing data set. For testing set we downloaded 226 reported targets from Arabidopsis Small RNA
Project (ASRP). To make our testing set non-redundant we removed the sequences which are already present in testing data set. Finally out of 226 sequences we were left with 125 sequences for testing.
SVMTorch was used to create plant specific model on various kernals for classification and scoring. The model classifies data into true target (true positive) and
non target (true negative) and assigns a regression score to finally identify the propensity of the target:miRNA pair as a potential one.
The optional step:
p-TAREF uses RNAHybrid output as template for prediction of targets. These RNAHybrid output is parsed and aligned with reverse complement miRNAs. After the alignment
refinement the output is used to create single sequence encoded pattern generation, these patterns are than scanned against experimentally validated patterns with
given mismatch. G:U wobble has been considered and accordingly implemented in the scoring matrix of used in backend for alignemnt.
The critical step: flanking region analysis
The targets which matche with experimentally validated patterns are mapped again on input sequences. The sequences are fragmented into 75 bp upstream and 75 bp
downstream with respect to potential target position. These fragmented sequences are scanned through only last nucleotide overlapping windows of 20 nucleotides, to
estimate dinucleotide density variation profiles with respect to distance from the target site. The final set of distance specific dinucleotide density profile
represents the seqquences with respect tot he target position. The SVR classification step followes this which also assigns as regression score to the classified instance.
Feature selection step was carried out to identifiy most discriminating fetaures using the following equation:
The concurrency :
To speed up the prediction we used Java Concurrent Library(JCL), which can run p-TAREF on multiple processors. The time consuming steps of RNAHybrid, alignment of
template with reverse compliment miRNA, encoded pattern matching with specified cut off runs concurrently on different processors. p-TAREF splits the input files
into several smaller files according to number of processors and create RNAHybrid threads for each processors eg. For 8 processors, 8 concurrent RNAHybrid instance
will be created and sequence file will be fragmented into 8 files. A single sequence is chopped into several small subsequences and distributed across the processors
. The following steps of alignment refinement run on completion of RNAhybrid run even if single instance of RNAhybrid is over. Alignment starts on the output
generated by RNAHybrid, thus p-TAREF run concurrently on different processors and each thread runs independently on each logical processor/core. Like wise, the preceding
steps of encoded pattern generation and matching with experimentally validated pattern also run concurrently and independently. Final parsing for SVR input is done
on single processor as SVR step is not time consuming one. In case, any thread of p-TAREF remains running, the classification step would not take place. The final classification step takes place only when all the threads are completed.
Read me file for p-TAREF GUI (Stand-alone is available here )
Linux 64 bit O.S. with Qt 4
Major Linux distribution comes with Qt 4 as installed package if Qt 4 is not installed then follow this simple procedure
Installing Qt 4:
For Ubuntu type "sudo apt-get install libqt4-dev build-essential packages" on terminal without quotes
For other Linux distribution type "yum install qt4 libqt4-*" on terminal as root
Or download Qt framework from http://qt.nokia.com/downloads
Extract p-TAREF-v1.tar.gz migrate to folder p-TAREF-v1 and on terminal type "qmake-qt4" and then "make" as root ( for ubuntu sudo ) this will form a binary named Installer, double click on Installer (if you logged in as root, else on command line type ./Installer as root, on ubuntu sudo ./Installer) and a GUI will appear, click on checkbox and then on install button this will install all the dependencies needed by p-Taref.
After installing all the dependencies, click on "p-TAREF-1.0-Linux-x86_64-Install" binary the p-TAREF installer will ask you where to install p-TAREF.
Default installation is done in the directory: /home/(username)/p-TAREF. However you can choose for some other folder too. To run p-TAREF, click on binary p-TAREF and a GUI will appear which executes the command for p-TAREF,
If you want to run p-TAREF on command line type java -jar p-TAREF.jar "number of processors".
p-TAREF takes the sequences in fasta format. For standalone, you can submit multiple sequences. For webserver too one can submit multiple sequences. However it is
advisable to subject only one input on the werserver edition. For multiple queries, it would be better to use standalone. p-TAREF might take some time depending on your input file size. p-TAREF steps run concurrently on multiple processors.
# Sequence should be in fasta format followed by TAIR ID.
# Sequences can be in multiple line or in single line.
For query, bugs or information please contact email@example.com or firstname.lastname@example.org