Fishing for a reelGene: evaluating gene models with evolution and machine learning.

Schulz AJ, Zhai J, AuBuchon-Elder T, Andorf CM, El-Walid MZ, Ferebee TH, Gilmore EH, Hufford MB, Johnson LC, Kellogg EA, La T, Long E, Miller ZR, Portwood JL, Romay MC, Seetharam AS, Stitzer MC, Woodhouse MR, Wrightsman T, Buckler ES, Monier B, Hsu SK

Published: 8 September 2025 in The Plant journal : for cell and molecular biology
Keywords: evolution, gene annotation, gene models, genome biology, machine learning, maize
Pubmed ID: 40983054
DOI: 10.1111/tpj.70483

Assembled genomes and their associated annotations have transformed our study of gene function. However, each new annotated assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses the conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million transcript models in Zea mays ssp. mays (maize), reelGene classified 28% as incorrectly annotated or non-functional. We find that reelGene classifies 92.2% of genes in the maize proteome and 99.2% of genes within the maize classical gene list as functional. reelGene also provides a way to further investigate genome biology- for instance, reelGene indicates that 10.3% of dispensable genes in B73 are functional, and within retained duplicate genes, reelGene identifies a 30% bias toward the retention of the M1 subgenome when one copy is functional and the other is non-functional. As an annotation-evaluating tool, reelGene is directly applicable to species of the Andropogoneae tribe, including other important crops like sorghum and miscanthus. As a community resource, reelGene has been integrated onto MaizeGDB both as a browser track and as an individual Shiny App, allowing researchers to evaluate gene model accuracy and further investigate genome biology.