Nikhil S, Shaik Mohideen H, Natesan Sella R
Long non-coding RNA (lncRNA) is a major transcript category that lacks protein-coding capabilities, with relatively low abundance and complex expression patterns. Distinguishing lncRNAs from protein-coding genes is a complex process involving multiple filtering steps. We developed an automated pipeline named LncRAnalyzer featuring retrained models for 60 species. This workflow aims to reduce the likelihood of obtaining protein-coding or partial protein-coding transcripts during lncRNA identification by utilizing eight distinct approaches. We conducted a 10-fold cross-validation of the sorghum models and training sets with their standard ones and other approaches using real-life RNA-Seq datasets and known lncRNA and CDS sequences of sorghum. The results showed that the sorghum models and training sets were outperformed. The pipeline output comprises upset plots illustrating the number of lncRNA/NPCTs identified by the approaches, commonly identified lncRNA and their classes, NPCTs, and expression count tables. A feature-level comparison and benchmarking analysis of LncRAnalyzer with four existing pipelines, namely, LncPipe, LncEvo, lncRNA-Annotation, and Plant-LncPipe, demonstrated that LncRAnalyzer is more comprehensive, easier to implement, and accurate in lncRNA predictions. This workflow also ascertains lncRNA origins from various Transposable Elements (TEs) in plants using TE annotations from APTEdb [http://apte.cp.utfpr.edu.br/]. LncRAnalyzer is publicly available on GitLab [https://gitlab.com/nikhilshinde0909/LncRAnalyzer.git] for academic users.