Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization
Suyash Mishra, Qiang Li, Anubhav Girdhar
TL;DR
This paper tackles quality-control bottlenecks in pharmaceutical content generation by introducing LRBTC, a modular QC framework that splits validation into Language, Regulatory, Brand, Technical, and Content Structure checks. It deploys a plug-and-play Student–Teacher dual-head LLM architecture with HITL oversight and a hierarchical waterfall rule-filtering process to achieve scalable, verifiable content validation across text and multimodal content. Empirical evaluation on industrial benchmarks AIReg-Bench and CSpelling shows the approach improves regulatory compliance detection (Recall up to $97.5\%$) and medical-language quality control (average $26.7\%$ improvement), with ablations suggesting best performance when using same-family teacher–student models and cost-efficient routing. The work emphasizes interpretability, auditability, and practical deployment implications for high-stakes domains, and suggests future work to address remaining punctuation and grammar errors and to broaden applicability to other regulated industries. $\textit{Plug-and-play}$ QC with hierarchical rule filtering offers a scalable path for reliable content optimization in regulated settings.
Abstract
Large language models (LLMs) are increasingly used to create content in regulated domains such as pharmaceuticals, where outputs must be scientifically accurate and legally compliant. Manual quality control (QC) is slow, error prone, and can become a publication bottleneck. We introduce LRBTC, a modular LLM and vision language model (VLM) driven QC architecture covering Language, Regulatory, Brand, Technical, and Content Structure checks. LRBTC combines a Student-Teacher dual model architecture, human in the loop (HITL) workflow with waterfall rule filtering to enable scalable, verifiable content validation and optimization. On AIReg-Bench, our approach achieves 83.0% F1 and 97.5% recall, reducing missed violations by 5x compared with Gemini 2.5 Pro. On CSpelling, it improves mean accuracy by 26.7%. Error analysis further reveals that while current models are strong at detecting misspellings (92.5 recall), they fail to identify complex medical grammatical (25.0 recall) and punctuation (41.7 recall) errors, highlighting a key area for future work. This work provides a practical, plug and play solution for reliable, transparent quality control of content in high stakes, compliance critical industries. We also provide access to our Demo under MIT Licenses.
