MOTIVATION: The exponential growth of non-coding RNA research-with over 230 000 papers published since 2000-has created an urgent knowledge management crisis in molecular biology. Despite their crucial regulatory roles, microRNAs (miRNAs) face a significant curation bottleneck, with only 1400 articles manually curated to the Gene Ontology (GO) knowledgebase over a decade. This highlights the critical need for automated systems that can accelerate biocuration while maintaining high-quality standards. RESULTS: We present GOFlowLLM, an automated curation pipeline powered by reasoning-enabled Large Language Models (LLMs) that follows established GO curation flowcharts to extract and structure miRNA-mediated gene silencing data at scale. When evaluated on existing curation, GOFlowLLM selects the correct GO term in 90% of cases, with curators agreeing with 95% of the system's reasoning steps and 90% of the evidence selected. Applied to 6996 previously uncurated articles using the Qwen QwQ-32B model, our system identified 2538 new candidate GO annotations on 1785 articles in just 58 hours-potentially doubling the available miRNA GO curation. Manual review shows curators agreed with the selected term in 87% of cases, the model's reasoning in 92% of cases, and the extracted evidence in 93%. The integration of reasoning traces provides transparent justification for annotations that can be reviewed by human curators, addressing a key challenge in adopting AI for scientific curation. AVAILABILITY AND IMPLEMENTATION: GOFlowLLM is implemented as an automated pipeline that follows expert-designed reasoning frameworks to maintain curation quality. The system is available on GitHub: https://github.com/RNAcentral/GO_Flow_LLM.
GOFlowLLM—curating miRNA literature with large language models and flowcharts
Panni, Simona;
2026-01-01
Abstract
MOTIVATION: The exponential growth of non-coding RNA research-with over 230 000 papers published since 2000-has created an urgent knowledge management crisis in molecular biology. Despite their crucial regulatory roles, microRNAs (miRNAs) face a significant curation bottleneck, with only 1400 articles manually curated to the Gene Ontology (GO) knowledgebase over a decade. This highlights the critical need for automated systems that can accelerate biocuration while maintaining high-quality standards. RESULTS: We present GOFlowLLM, an automated curation pipeline powered by reasoning-enabled Large Language Models (LLMs) that follows established GO curation flowcharts to extract and structure miRNA-mediated gene silencing data at scale. When evaluated on existing curation, GOFlowLLM selects the correct GO term in 90% of cases, with curators agreeing with 95% of the system's reasoning steps and 90% of the evidence selected. Applied to 6996 previously uncurated articles using the Qwen QwQ-32B model, our system identified 2538 new candidate GO annotations on 1785 articles in just 58 hours-potentially doubling the available miRNA GO curation. Manual review shows curators agreed with the selected term in 87% of cases, the model's reasoning in 92% of cases, and the extracted evidence in 93%. The integration of reasoning traces provides transparent justification for annotations that can be reviewed by human curators, addressing a key challenge in adopting AI for scientific curation. AVAILABILITY AND IMPLEMENTATION: GOFlowLLM is implemented as an automated pipeline that follows expert-designed reasoning frameworks to maintain curation quality. The system is available on GitHub: https://github.com/RNAcentral/GO_Flow_LLM.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


