A revolutionary algorithm to identify plant pre-miRNAs

Illustration to running workflow of standalone program



Running the standalone program

To build model implementing hyperparameter tuning
Example: python3 hyper_param.py file_for_tuning

Input file description

file_for_tuning = file containing label (0/1), sequence (sequence containing pre-miRNAs and non-pre-miRNAs), and dot bracket ("(">"M", ".">"O", ")">"N"). All in one line separated by tab for a single instance.

Label Sequence Secondary Structure
1 UGAUAAACAAAGUGUGUAACAUCACCUCAUCUACAUGUGUGAUUUUUUUUUUGAAUAUAGACAACUU
UUUAGUCAGAGUUUACAUGAGUUUUCACCUAAUUUGUGGUUUAAUUACACCGCAUAUUUGCCCAAUU
UAGUGAGUAUAGUGAGUUUCUGUAGAGAAGCUCAUCUUAGAAUUAUUCAUGUAUUCCACUACUAAA
OOOOOOOOMMMMMMMOMMOOMMMMMOOMMMOOOONNNONNNNNOOOOOOOOOOOOOONNONNONNNN
NMMMMMOMMOMOOMMMMMMMMMOMMMOOOOOOOOOMMMMMOOOOOOOONNNNNMMMMMMMMOOOOOO
OONNNNNNNNOMMMMMMMMMMOOOONNNNNNNNNNOOOONNNOONNNNNNNNNOONOONNNNNNNO
0 AGAUCUACAAGAGAAGAUAAGUUUGAGGCAAAUUCGAGAUCUGGAAGCUGGUUUUCUCUUUACAAAU
AACACUAACCCUACCAUCAAAUCAAGAAAGGAGGCUUUGAACAAAUAGCUUGAUUGAAGUAUGAAGU
GGCUCGGUGGGCGACGAUGACGGGCGAGCUCCGGCGAGGGCCUGGGGGCCUGAGCGACGAUGACGG
OOMMMOOOMMMMMMMMMMOMMMMMOMMMMOOOOOOOONONNNOONNNNNONNNNNNNNNNOOOOOOO
OOMMMMOOOMOMMMOOMMMMOMMMMMMMMMOOOONNNNOOOOOOOOOONNNNNNNNNONNNONONNN
NMMMMMOMOMOOOOOOOOOONONONNNNNOOOOMMOOMMMMMOOOONNNNNOONNOOONNNOOOOO


Output file description

param.txt = Optimized hyparameters for Transformers.

NameValue
learning_rate0.583
activationrelu
activation2selu
activation3LeakyRelu
batch_size40
embed_dim28
epochs20
num_heads14
ff_dim14
neurons38
neurons212
dropout_rate0.16
dropout_rate20.17
OptimizerAdadelta

miWords.h5 = Hyperparameter optimized trained model for transformer.
miWords.sav = Hyperparameter optimized trained model for XGBoost.

Module 1: Sequence having length <= 400 bases
Example: sh M1.sh folderpath test

Output description

sequence_feature.tsv = Classification result of the sequence provided.

Sequence IDT-ScoreResult
seq_10.945pre-miRNA
seq_20.154non pre-miRNA
seq_30.389non pre-miRNA
seq_40.458non pre-miRNA
seq_50.875pre-miRNA

Module 2 (A): Genomic sequence longer than 400 bases (Implementing Transformers + T-Score CNN when sRNA-seq data is not available)
Example: sh M2.sh folderpath genomic_sequence A (genomic_sequences is written as t2 in guide figure.)

Output description

merge.txt = output of module 2 (Sequence, Secondary Struture, Position wise T-Score).

S. No.SequenceSecondary structurePosition wise T-Score
1GCTAGTAAATTTGTTGATTCATGCTTGTAGATGTACACACCACAGCAGATGCATGATGCATGGTTGGCATAGATAAGAATATAGAGGCAGGGGCA
AAGGAAGCATGAGCTTGTGGGAGTTGTAGGAATGGATGCGTAGAGAAAGAAATGAAATCTGAACCAGA
TTTGTGATTTGATTCTTTCCTATGACGTCCATTCCAATGATTTCCAGTCGCTCCTTCCTTGACCCAAACTGCTTCTTCTCTCCCCATTCCCCAAGGATGCAGCCAAATTACTCCTAA
...(((((.((.((((.......(((((..((((.((.((((..(((.........))).))))))))))..))))).....(((((((
((((..(((((((...((((...((((((((((.(((((((((((((((.(((((((.(.(((((..............
))))).))))))))))))).)))))))))).))))))))))...)))))))))))..)))...))))))))........(((((.....))))))))).)).))))).....
[0.92965674, 0.07397337, 0.07397337, 0.07397337, 0.22203702, 0.92965674]

genomic_sequence.bed = output from T-Score CNN (bed6 format).

Sequence IDStartEndNameScoreStrand
seq_Ath387667abcd60+
seq_Ath407687abcd60-

seq.txt = Sequence position window wise T-score of the sequence region in the window.

Sequence IDPositionT-ScoreSequenceSecondary structure
seq_Ath00.07397337AAUUAGUCGUAUUCAGUUGUAAAUUCGUUUUUCGUGUAAUGGAGGGAGUACAGGACAAGCUGGCAAGUGGUCUUUGGAUUCCAUGAAGCCUGCUGCCGCGUACAGAAGUCACUAGUGUAGUAGUGGCA
CUGAAACGACGUGUGCAUGCUGAUGCUGUCGCCGCAUCCCAUCCCAGUCCUUUUUUUUUUCGGGGACCCAUC
.....(((((.((((((..........((((((........))))))((((.((.((.((.(((..((((((....))..))))...))).))))))..))))....(((((((......))))))))
)))))))))).((.(.(((..(((((.......))))).))).)))((((((.........)))))).....
seq_Ath10.07397337AUUAGUCGUAUUCAGUUGUAAAUUCGUUUUUCGUGUAAUGGAGGGAGUACAGGACAAGCUGGCAAGUGGUCUUUGGAUUCCAUGAAGCCUGCUGCCGCGUACAGAAGUCACUAGUGUAGUAGUGGCAC
UGAAACGACGUGUGCAUGCUGAUGCUGUCGCCGCAUCCCAUCCCAGUCCUUUUUUUUUUCGGGGACCCAUCC
....(((((.((((((..........((((((........))))))((((.((.((.((.(((..((((((....))..))))...))).))))))..))))....(((((((......)))))))))
))))))))).((.(.(((..(((((.......))))).))).)))((((((.........))))))......
seq_Ath20.07397337UUAGUCGUAUUCAGUUGUAAAUUCGUUUUUCGUGUAAUGGAGGGAGUACAGGACAAGCUGGCAAGUGGUCUUUGGAUUCCAUGAAGCCUGCUGCCGCGUACAGAAGUCACUAGUGUAGUAGUGGCACU
GAAACGACGUGUGCAUGCUGAUGCUGUCGCCGCAUCCCAUCCCAGUCCUUUUUUUUUUCGGGGACCCAUCCC
...(((((.((((((..........((((((........))))))((((.((.((.((.(((..((((((....))..))))...))).))))))..))))....(((((((......))))))))))
)))))))).((.(.(((..(((((.......))))).))).)))((((((.........)))))).......
seq_Ath30.07397337UAGUCGUAUUCAGUUGUAAAUUCGUUUUUCGUGUAAUGGAGGGAGUACAGGACAAGCUGGCAAGUGGUCUUUGGAUUCCAUGAAGCCUGCUGCCGCGUACAGAAGUCACUAGUGUAGUAGUGGCACUG
AAACGACGUGUGCAUGCUGAUGCUGUCGCCGCAUCCCAUCCCAGUCCUUUUUUUUUUCGGGGACCCAUCCCA
..(((((.((((((..........((((((........))))))((((.((.((.((.(((..((((((....))..))))...))).))))))..))))....(((((((......)))))))))))
))))))).((.(.(((..(((((.......))))).))).)))((((((.........))))))........

plot = folder containing "csv" files to construct line plot.
merge = folder containing "csv" files to construct line plot overlapping sequence.

Module 2 (B): If the user is providing sequencing read data (sRNA-seq) and also sequence length is >400 bases:

User needs to process the fastq file into bed6 format by instructions mentioned below (Note: User can select their own mapping software)

Step 1: Build Genome Index: hisat2-build genome index

Step 2: Read Mapping: hisat2 -x index -U read.fastq -S read.sam

Step 3: Convert SAM to bed6 format: sam2bed read.sam |cut -f1-6 > read.bed

If multiple condition are available then merge them into one bed file: cat *.bed >read.bed

Step 4: sh M2.sh folderpath genomic_sequence B read.bed (genomic_sequences is written as t2 in guide figure.)

Output description

read.bed = read alignment data in bed6 format (converted from SAM) (generated by user).

Sequence IDStartEndNameScoreStrand
seq_Ath167179abcd60-
seq_Ath169180abcd60-
seq_Ath237249abcd60-
seq_Ath313325abcd60+
seq_Ath535549abcd60+
seq_Ath542553abcd60-
seq_Ath542553abcd60-
seq_Ath542553abcd60-
seq_Ath542553abcd60-
seq_Ath542554abcd60-

genomic_sequence = Fasta sequence utilized for Bimodal CNN.
genomic_sequence.bed = output from Bimodal CNN (bed6 format).

Sequence IDStartEndNameScoreStrand
seq_Ath91371abcd60+
seq_Ath406686abcd60+
seq_Ath435715abcd60-
seq_Ath521801abcd60-

genomic_sequence_rpm.txt = Final result (ID, Start, End, Strand, Sequence, Secondary Struture).
Sequence IDStartEndStrandSequenceSecondary structure
seq_Ath397677+UUGUUGAUUCAUGCUUGUAGAUGUACACACCACAGCAGAUGCAUGAUGCAUGGUUGGCAUAGAUAAGAAUAUAGAGGCAGGGGCAAAGGAAGCAUGAGCUUG
UGGGAGUUGUAGGAAUGGAUGCGUAGAGAAAGAAAUGAAAUCUGAACCAGAUUUGUGAUUUGAUUCUUUCCUAUGACGUCCAUUCCAAUGAUUUCCAGUCGCUCCUUCCUUGACCCAAACUGCUUCUU
CUCUCCCCAUUCCCCAAGGAUGCAGCCAAAUUACUCCUAAUUUGCUCCUC
.(((((((.(((((.((((..(((..........)))..))))....))))))))))))......(((....(((((((((((..(((((((...((((...((((((((((.(((((((((((((((.(((((((.
(.(((((..............))))).))))))))))))).)))))))))).))))))))))...)))))))))))..)))...))))))))..)))............((((.(((....(((((....)))))))))))).
seq_Ath407687-GGCCGACCUGGAGGAGCAAAUUAGGAGUAAUUUGGCUGCAUCCUUGGGGAAUGGGGAGAGAAGAAGCAGUUUGGGUCAAGGAAGGAGCGACUGGAAAUCAUU
GGAAUGGACGUCAUAGGAAAGAAUCAAAUCACAAAUCUGGUUCAGAUUUCAUUUCUUUCUCUACGCAUCCAUUCCUACAACUCCCACAAGCUCAUGCUUCCUUUGCCCCUGCCUCUAUAUUCUUAUCU
AUGCCAACCAUGCAUCAUGCAUCUGCUGUGGUGUGUACAUCUACAAGCAU
(((...(((.(((((((((((((....)))))....))).))))).))).((((((((((.(((.((((...((((.(((((((((((...(((........((((((((.((..((((((((((........
.(((((((...)))))))...))))))).))).)).))))))))........)))...))))...))))))).)))))))).)))...))))).))))))))....(((((...))))).((((((((((....)).)))).)))).


Python script to build model implementing hyperparameter tuning miWords: Module 1 execution script. (Sequence having length less than 400 base) folderpath: /home/user/miWords File Containing RNA sequences
>seq_1

UGAUAAACAAAGUGUGUAACAUCACCUCAUCUACAUGUGUGAUUUUUUUUUUGAAUAUAGACAACUUUUUAGUCAGAGUUUACAUGAGUUUUCACCUAAUUUGUGGUUUAAUUACACCGCAUAUUU
GCCCAAUUUAGUGAGUAUAGUGAGUUUCUGUAGAGAAGCUCAUCUUAGAAUUAUUCAUGUAUUCCACUACUAAA
>seq_2
AGAUCUACAAGAGAAGAUAAGUUUGAGGCAAAUUCGAGAUCUGGAAGCUGGUUUUCUCUUUACAAAUAACACUAACCCUACCAUCAAAUCAAGAAAGGAGGCUUUGAACAAAUAGCUUGAUUGAAG
UAUGAAGUGGCUCGGUGGGCGACGAUGACGGGCGAGCUCCGGCGAGGGCCUGGGGGCCUGAGCGACGAUGACGG
>seq_3
AACGGGUCGUGCCGGCACGGCCCACGAGCGGGCGUGCCGUGCCGUUCCUGGGCCGGCUACAGUGCUGCCGUGCUCGGGCCGGCACGCCUUGGCCCGGCCCAUUUGGCCAGGUAUACUAGUCGGCUC
CAGUCCUCCUCCCCCCAGCGACCUAAGCCGCCACCGCCCUCGCCGCGCUACCGCCAGCGCCGCCUGCCGUCCCU
>seq_4
UUUUCCCCUUGAUUUUAGGGUUAGGGUUUCAUGAUUUGGGGAAAAAUUUGGGAUCUUACUGUAGCUAGGGUUUCGGUUCUUGGGGAUUUGUCUGAGAUUUGCAUGAACUUUUGCUUUCCCCUUCUU
CUAAUUUCUUCAACCCGAAACCCUAGAAGGCCUAAUUCCAUUUCUUAUAUUUCGGGAUUGCAUGAUUUGGGCUU
>seq_5
CCGGACGAUAGCAAGCGCUGGCAGUAGAGUAGGCUAGAGUCAUGAGUCUGAGUCAUGCUGGCUUUAUAUAGACAAAAAAUGGUACUACACACAAAUGAAAUUUCUAGCAAAAAUAAUCAAUGCACU
UUCCUUGAUUACACACCAACUUUAUGUAUAUAUAGGCUGGAAUAAUCCAUUGUGCAUGUACAUGAAUAUAGAUU
miWords: Module 2 execution script. (Sequence having length more than 400 base) File Containing genomic sequences
>seq_Ath

UGAUAAACAAAGUGUGUAACAUCACCUCAUCUACAUGUGUGAUUUUUUUUUUGAAUAUAGACAACUUUUUAGUCAGAGUUUACAUGAGUUUUCACCUAAUUUGUGGUUUAAUUACACCGCAUAUUU
GCCCAAUUUAGUGAGUAUAGUGAGUUUCUGUAGAGAAGCUCAUCUUAGAAUUAUUCAUGUAUUCCACUACUAAAAGAUCUACAAGAGAAGAUAAGUUUGAGGCAAAUUCGAGAUCUGGAAGCUGGU
UUUCUCUUUACAAAUAACACUAACCCUACCAUCAAAUCAAGAAAGGAGGCUUUGAACAAAUAGCUUGAUUGAAGUAUGAAGUGGCUCGGUGGGCGACGAUGACGGGCGAGCUCCGGCGAGGGCCUG
GGGGCCUGAGCGACGAUGACGGAACGGGUCGUGCCGGCACGGCCCACGAGCGGGCGUGCCGUGCCGUUCCUGGGCCGGCUACAGUGCUGCCGUGCUCGGGCCGGCACGCCUUGGCCCGGCCCAUUU
GGCCAGGUAUACUAGUCGGCUCCAGUCCUCCUCCCCCCAGCGACCUAAGCCGCCACCGCCCUCGCCGCGCUACCGCCAGCGCCGCCUGCCGUCCCUUUUUCCCCUUGAUUUUAGGGUUAGGGUUUC
AUGAUUUGGGGAAAAAUUUGGGAUCUUACUGUAGCUAGGGUUUCGGUUCUUGGGGAUUUGUCUGAGAUUUGCAUGAACUUUUGCUUUCCCCUUCUUCUAAUUUCUUCAACCCGAAACCCUAGAAGG
CCUAAUUCCAUUUCUUAUAUUUCGGGAUUGCAUGAUUUGGGCUUCCGGACGAUAGCAAGCGCUGGCAGUAGAGUAGGCUAGAGUCAUGAGUCUGAGUCAUGCUGGCUUUAUAUAGACAAAAAAUGG
UACUACACACAAAUGAAAUUUCUAGCAAAAAUAAUCAAUGCACUUUCCUUGAUUACACACCAACUUUAUGUAUAUAUAGGCUGGAAUAAUCCAUUGUGCAUGUACAUGAAUAUAGAUU
A: T-Score CNN
B: Bimodal CNN
An example of fastQ file
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Sequence ID Start End Name Score Strand
chr1 0 100 abcd 255 +
chr1 10 125 abcd 255 -
sam2bed script parses SAM data from standard input and prints sorted BED to standard output Elaboration: cat read1.bed read2.bed read3.bed read4.bed >read.bed HISAT2: is a fast and sensitive alignment program for mapping next-generation sequencing reads against a reference genome. genome: file with reference sequences index: Index filename prefix (minus trailing .X.ht2) read.fastq: sRNA-seq data read.sam: read alignment data in SAM format. read.bed: read alignment data in bed6 format. Sequence Alignment Map (SAM): is a text-based format originally for storing biological sequences aligned to a reference sequence. Click to expand CSV files: seq_Ath_0.csv, seq_Ath_8.csv, seq_Ath_12.csv

x
This website uses cookies to ensure you get the best experience on out website. OK