p-TAREF (Project funded by DBT, Govt. of India.)

p-TAREF stands for plant TArget REFiner. p-TAREF refines the process of microRNA target identification through incorporation of local interaction information for microRNA and target region in the target sequence as well as estimates the candidature of being a true microRNA target on the basis of intrinsic sequence property around the target site determined by varying dinculeotide density profile. Also by considering dinculeotide density we incorporate the nearest neighbor approach of RNA structure determination.
Therefore p-TAREF works with two major assumptions:

  1. Interaction information between microRNA and target is more important than sequence information and such interactions are highly specific to a given microRNA. Experimentally validated interactions can be used to filter out similar interactions in the predicted system to define a preliminary filtering process.
  2. Varying dinucleotide density profiles in the flanking regions around the target region behaves as a dynamic and reasonable signature to determine the target sites. Such parameters grossly cover the intrinsic property of the nearby region and sequential arrangement of this varying profile around the target region captures the discrimination.

Basically, p-TAREF takes the output of RNAhybrid as the template.

p-TAREF runs in two parts. In first, it considers the encoded interaction pattern library generated from experimentally reported interactions for plants. These encoded interaction patterns were derived after careful alignment of interaction partners, taking a bit extended target region to get more accurate alignment. The parameters used during this step are retained and used during derivation of encoded patterns from the interactions predicted by RNAhybrid.

Once the encodes are generated from the output of RNAhybrid, the library of experimental encoded patterns is used for scanning for similar interactions in the predicted interactions. At this level itself lots of candidates can be removed. The selected candidates are subjected to the support vector scan step. The support vector regression machine utilizes the information of intrinsic sequence property of the flanking regions around the target region in the form of varying dinculeotide density profiles at various intervals. We found that the peak discrimination capacity of SVR models was observed when window of 20 bases were considered. Based on the cassification, the SVR machine assigns a score to all the targets found.A positive score indicates potential positive target whereas negative scores indicate non targets.

Our methodology on which p-TAREF is based, has attained accuracy level of ~95% with high level of sensitivities and specificities and Area Under Curve for ROC analysis were above 0.9. Details are present in performance tab.

Using Java Concurrent Library(JCL), our tool p-TAREF can run on multiple processors, reducing total execution time. p-TAREF creates multiple instances of RNAHybrid equal to number of processors.The outfile from RNAHybrid is input for following steps, which also run concurrently like RNAHybrid.

 

 

Work flow of p-TAREF on four processors

 

Working of p-TAREF

All the time consuming steps of RNAHybrid, alignment and predicted pattern matching with experimentally validated patterns with specified cut off. I built codes for distribution, alignment, parsing etc run concurrently on multiple processors. The final classification and scoring of targets is done by the associated SVR.

Support Vector Regression

SVR (Support Vector Regression ) or Support Vector Machine for regression is a machine learning technique to solve regression problem by SVM. SVM classifies data into positive and negative based on a model, SVR also classifies the data like SVM and gives the score to each classification showing the accuracy of classification. p-TAREF uses SVMTorch for SVM classification and regression score.

Methodology

Data set generation:
For our model, we download 104 sequences from TAIR relaese 10 for Arabidopsis thaliana . For negative dataset instances were taken from TAIR as well as ENSEMBLE Release 62 (Supplementry1.doc), in order to create our testing data set. For testing set we downloaded 226 reported targets from Arabidopsis Small RNA Project (ASRP). To make our testing set non-redundant we removed the sequences which are already present in testing data set. Finally out of 226 sequences we were left with 125 sequences for testing.

Model generation:
SVMTorch was used to create plant specific model on various kernals for classification and scoring. The model classifies data into true target (true positive) and non target (true negative) and assigns a regression score to finally identify the propensity of the target:miRNA pair as a potential one.

The optional step:
p-TAREF uses RNAHybrid output as template for prediction of targets. These RNAHybrid output is parsed and aligned with reverse complement miRNAs. After the alignment refinement the output is used to create single sequence encoded pattern generation, these patterns are than scanned against experimentally validated patterns with given mismatch. G:U wobble has been considered and accordingly implemented in the scoring matrix of used in backend for alignemnt.

The critical step: flanking region analysis
The targets which matche with experimentally validated patterns are mapped again on input sequences. The sequences are fragmented into 75 bp upstream and 75 bp downstream with respect to potential target position. These fragmented sequences are scanned through only last nucleotide overlapping windows of 20 nucleotides, to estimate dinucleotide density variation profiles with respect to distance from the target site. The final set of distance specific dinucleotide density profile represents the seqquences with respect tot he target position. The SVR classification step followes this which also assigns as regression score to the classified instance.
Feature selection step was carried out to identifiy most discriminating fetaures using the following equation:


Feature number
F-Score
Feature
53
0.471384
TA
85
0.398806
TA
5
0.235890
TA
37
0.233669
TA
101
0.223225
TA
69
0.195340
TA
82
0.163212
AT
21
0.146982
TA
2
0.131172
AT
98
0.121356
AT

The concurrency :
To speed up the prediction we used Java Concurrent Library(JCL), which can run p-TAREF on multiple processors. The time consuming steps of RNAHybrid, alignment of template with reverse compliment miRNA, encoded pattern matching with specified cut off runs concurrently on different processors. p-TAREF splits the input files into several smaller files according to number of processors and create RNAHybrid threads for each processors eg. For 8 processors, 8 concurrent RNAHybrid instance will be created and sequence file will be fragmented into 8 files. A single sequence is chopped into several small subsequences and distributed across the processors . The following steps of alignment refinement run on completion of RNAhybrid run even if single instance of RNAhybrid is over. Alignment starts on the output generated by RNAHybrid, thus p-TAREF run concurrently on different processors and each thread runs independently on each logical processor/core. Like wise, the preceding steps of encoded pattern generation and matching with experimentally validated pattern also run concurrently and independently. Final parsing for SVR input is done on single processor as SVR step is not time consuming one. In case, any thread of p-TAREF remains running, the classification step would not take place. The final classification step takes place only when all the threads are completed.


Read me file for p-TAREF GUI (Stand-alone is available here )

1.System Requirement
2.Installation
3.p-Taref description
4.File formats

1.System requirement:
Linux 64 bit O.S. with Qt 4
Major Linux distribution comes with Qt 4 as installed package if Qt 4 is not installed then follow this simple procedure
Installing Qt 4:

For Ubuntu type "sudo apt-get install libqt4-dev build-essential packages" on terminal without quotes
For other Linux distribution type "yum install qt4 libqt4-*" on terminal as root
Or download Qt framework from http://qt.nokia.com/downloads

2.Installing:
Extract p-TAREF-v1.tar.gz migrate to folder p-TAREF-v1 and on terminal type "qmake-qt4" and then "make" as root ( for ubuntu sudo ) this will form a binary named Installer, double click on Installer (if you logged in as root, else on command line type ./Installer as root, on ubuntu sudo ./Installer) and a GUI will appear, click on checkbox and then on install button this will install all the dependencies needed by p-Taref. After installing all the dependencies, click on "p-TAREF-1.0-Linux-x86_64-Install" binary the p-TAREF installer will ask you where to install p-TAREF. Default installation is done in the directory: /home/(username)/p-TAREF. However you can choose for some other folder too. To run p-TAREF, click on binary p-TAREF and a GUI will appear which executes the command for p-TAREF, If you want to run p-TAREF on command line type java -jar p-TAREF.jar "number of processors".

3.p-Taref description:
p-TAREF takes the sequences in fasta format. For standalone, you can submit multiple sequences. For webserver too one can submit multiple sequences. However it is advisable to subject only one input on the werserver edition. For multiple queries, it would be better to use standalone. p-TAREF might take some time depending on your input file size. p-TAREF steps run concurrently on multiple processors.

4.File formats:
# Sequence should be in fasta format followed by TAIR ID.
# Sequences can be in multiple line or in single line.
>AT1G02860.1
ATGTTCCCAAAACAAACGTTTTGACTTTTTTTTTGTTTTCTCATAT
TCTTTTATTTCACAACTTGTTATTTCCGCCGACTT
CACCAGTCACCACCACCTT CATTTATTCATAAATACGTCTCTGTTCTGTTTTTGTTTCTGTTTCATTTTC
ATACATAATTAAGCCAACACGAGACGCAAGAGAGAGATAGGGAAAGAGA

For query, bugs or information please contact ravish@ihbt.res.in or ashwanijha.bioinfo@gmail.com




Work Flow Illustration







Some Related Links
RNAhybrid
miRBase
TAIR
RiceGE
ASRP
PMRD
weigel world

Administrator: Ashwani Jha and Heikham Russiachand Singh
Copyright © 2011, Institute Of Himalayan Bioresource Technology.