Illustration to running workflow of standalone program
Running the standalone program
To build model implementing hyperparameter tuning Example: python3 hyper_param.py file_for_tuning
1. file_for_tuning = file containing label (0/1), sequence (sequence containing pre-miRNAs and non-pre-miRNAs), and dot bracket ("(">"M", ".">"O", ")">"N"). All in one line separated by tab for a single instance. 2. param.txt = Optimized hyparameters for Transformers.
Module 1: Sequence having length <= 400 bases Example: sh M1.sh folderpath test
1. sequence_feature.tsv = Classification result of the sequence provided.
Module 2: Sequence having length > 400 bases (Implementing T-Score CNN) Example: sh M2.sh folderpath t2 A
1. merge.fa = output of module 2 (Sequence, Secondary Struture, Position wise T-Score) 2. test.bed = output of module 2 (BED6 format) 3. seq.txt = Chunks wise probability score of the sequence provided. 4. plot = folder containing "csv" files to construct line plot. 5. merge = folder containing "csv" files to construct line plot overlapping sequence.
If provided with read data(sRNA-seq):
User needs to process the fastq file into bed6 format by instructions mentioned below (Note: User can select their own mapping software)
Step 1: Build Genome: hisat2-build genome index Step 2: Mapping: hisat2 -x index -U read.fastq -S read.sam Step 3: Convert SAM to BED6 format: sam2bed read.sam |cut -f1-6 >read.bed
If multiple condition are available then merge them into one bed file: cat *.bed >read.bed
Module 2: Bimodal (RPM and T-scoring) CNN module Example: sh M2.sh folderpath t2 B read.bed
1. read.bed = read alignment data in BED6 format (converted from SAM) (generated by user). 2. t2 = dummy fasta sequence utilized for Bimodal CNN 3. test.bed = output from Bimodal CNN (BED6 format) 4. test_rpm.txt = Final result (ID, Start, End, Strand, Sequence, Secondary Struture)
Discovering pre-miRNAs is the core of miRNA discovery. Using ...traditional sequence/structural features many tools have been published to discover miRNAs. However, in practical applications like genomic annotations, their actual performance has been very low. This becomes more grave in plants where unlike animals pre-miRNAs are much more complex and difficult to identify. A huge gap exists between animals and plants for the available software for miRNA discovery and species specific miRNAs information. Here, we present miWords, a composite deep-learning system of transformers and CNNs which sees genome as a pool of sentences made of words with specific occurrence preferences and contexts, to accurately identify pre-miRNA regions across plant genomes. A comprehensive bench-marking was done involving >10 software representing different genre and many experimentally validated datasets. miWords emerged as the best one while breaching accuracy of 98% and performance lead of ~10%. miWords was also evaluated across Arabidopsis genome where also it outperformed the compared tools. As a demonstration, miWords was run across the Tea genome, reporting 803 pre-miRNA regions, all validated by sRNA-seq reads from multiple samples and most of them were functionally supported by the degradome sequencing data. miWords is freely available as stand-alone source codes at https://scbb.ihbt.res.in/miWords/index.php.
In order to run the complete miWords system (Transformers + T-Score + RPM-CNN), you are encouraged to download and run the standalone code from here.
Just to see how the transformers only part of miWords performs on some sequences (<= 400 bases), please run the below given online program.
miWords: Transformers based composite deep-learning for highly accurate discovery of pre-miRNA regions across plant genomes
Sagar Gupta†, Ravi Shankar*
Briefings in Bioinformatics, 2023 Read research article here