miRNAs are major post-transcriptional regulators. Discovering pre-miRNAs is the core of locating miRNAs and their genomic annotations. Using ...traditional sequence/structural features many tools have been published to discover miRNAs. However, in practical applications like genomic annotation, their actual performance has been far away from acceptable. This becomes more grave in plants where unlike animals pre-miRNAs are much more complex and difficult to identify. This is reflected by the huge gap between the available software for animal and plant miRNA discovery. Here, we present miWords, an attention based genomic language processing transformer and context scoring deep-learning approach to accurately identify pre-miRNAs in plants which can be extended to other eukaryotes also. During a comprehensive bench-marking the transformer part of miWords alone significantly outperformed the compared published tools with consistent performance while maintaining an accuracy of ~98% across a large number of experimentally validated data. Performance of miWords was also evaluated with Arabidopsis genome annotation where also miWords outperformed even those software which essentially use sRNA-seq reads to identify miRNAs. miWords was run across the Tea genome, reporting 803 pre-miRNAs, all validated by RNA-seq data. 10 such randomly selected novel pre-miRNAs were also experimentally validated through qRT-PCR.
This is highly recommended to use the standalone version of miWords for sequence longer than 400 base. The standalone version of miWords is available at Github.Download standalone version here.
Sentences, Words, Attention: A "Transforming" Aphorism of miRNA Discovery
Sagar Gupta†, Vishal Saini, Rajiv Kumar, Ravi Shankar*
bioRxiv, 2022 Read research article here