Table of Contents
Fetching ...

Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization

Suyash Mishra, Qiang Li, Anubhav Girdhar

TL;DR

This paper tackles quality-control bottlenecks in pharmaceutical content generation by introducing LRBTC, a modular QC framework that splits validation into Language, Regulatory, Brand, Technical, and Content Structure checks. It deploys a plug-and-play Student–Teacher dual-head LLM architecture with HITL oversight and a hierarchical waterfall rule-filtering process to achieve scalable, verifiable content validation across text and multimodal content. Empirical evaluation on industrial benchmarks AIReg-Bench and CSpelling shows the approach improves regulatory compliance detection (Recall up to $97.5\%$) and medical-language quality control (average $26.7\%$ improvement), with ablations suggesting best performance when using same-family teacher–student models and cost-efficient routing. The work emphasizes interpretability, auditability, and practical deployment implications for high-stakes domains, and suggests future work to address remaining punctuation and grammar errors and to broaden applicability to other regulated industries. $\textit{Plug-and-play}$ QC with hierarchical rule filtering offers a scalable path for reliable content optimization in regulated settings.

Abstract

Large language models (LLMs) are increasingly used to create content in regulated domains such as pharmaceuticals, where outputs must be scientifically accurate and legally compliant. Manual quality control (QC) is slow, error prone, and can become a publication bottleneck. We introduce LRBTC, a modular LLM and vision language model (VLM) driven QC architecture covering Language, Regulatory, Brand, Technical, and Content Structure checks. LRBTC combines a Student-Teacher dual model architecture, human in the loop (HITL) workflow with waterfall rule filtering to enable scalable, verifiable content validation and optimization. On AIReg-Bench, our approach achieves 83.0% F1 and 97.5% recall, reducing missed violations by 5x compared with Gemini 2.5 Pro. On CSpelling, it improves mean accuracy by 26.7%. Error analysis further reveals that while current models are strong at detecting misspellings (92.5 recall), they fail to identify complex medical grammatical (25.0 recall) and punctuation (41.7 recall) errors, highlighting a key area for future work. This work provides a practical, plug and play solution for reliable, transparent quality control of content in high stakes, compliance critical industries. We also provide access to our Demo under MIT Licenses.

Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization

TL;DR

This paper tackles quality-control bottlenecks in pharmaceutical content generation by introducing LRBTC, a modular QC framework that splits validation into Language, Regulatory, Brand, Technical, and Content Structure checks. It deploys a plug-and-play Student–Teacher dual-head LLM architecture with HITL oversight and a hierarchical waterfall rule-filtering process to achieve scalable, verifiable content validation across text and multimodal content. Empirical evaluation on industrial benchmarks AIReg-Bench and CSpelling shows the approach improves regulatory compliance detection (Recall up to ) and medical-language quality control (average improvement), with ablations suggesting best performance when using same-family teacher–student models and cost-efficient routing. The work emphasizes interpretability, auditability, and practical deployment implications for high-stakes domains, and suggests future work to address remaining punctuation and grammar errors and to broaden applicability to other regulated industries. QC with hierarchical rule filtering offers a scalable path for reliable content optimization in regulated settings.

Abstract

Large language models (LLMs) are increasingly used to create content in regulated domains such as pharmaceuticals, where outputs must be scientifically accurate and legally compliant. Manual quality control (QC) is slow, error prone, and can become a publication bottleneck. We introduce LRBTC, a modular LLM and vision language model (VLM) driven QC architecture covering Language, Regulatory, Brand, Technical, and Content Structure checks. LRBTC combines a Student-Teacher dual model architecture, human in the loop (HITL) workflow with waterfall rule filtering to enable scalable, verifiable content validation and optimization. On AIReg-Bench, our approach achieves 83.0% F1 and 97.5% recall, reducing missed violations by 5x compared with Gemini 2.5 Pro. On CSpelling, it improves mean accuracy by 26.7%. Error analysis further reveals that while current models are strong at detecting misspellings (92.5 recall), they fail to identify complex medical grammatical (25.0 recall) and punctuation (41.7 recall) errors, highlighting a key area for future work. This work provides a practical, plug and play solution for reliable, transparent quality control of content in high stakes, compliance critical industries. We also provide access to our Demo under MIT Licenses.
Paper Structure (11 sections, 7 figures, 12 tables)

This paper contains 11 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: System architecture of Student-teacher model to verify rule adoption. The teacher model guides the knowledge executed, and the student model verifies the commonly adopted rules, suggesting conflicts and new ideas. Iteratively verifies convergence into a common agreement, with a human in the loop to clean the leftover conflict rules. The core idea is that when knowledge is shared, common knowledge is enhanced and agreement is solidified. Conflicts or new ideas are often brought out by a counter-partner. Waterfall modeling can reduce the number of rules that need to be executed or checked. For rule filtering, we could apply a waterfall approach: IP - Country - Usecase - Topics - Subtasks (grammar/spell/etc), which would reduce and track the rules executed under each block.
  • Figure 2: Comprehensive Solution Architecture for Content Optimization.
  • Figure 3: Our model outperforms Gemini on all 7 samples from Cspelling, with per gains ranging from + 16.8% to + 32.9% and an overall improvement of 26.7%. However, the relatively large standard deviations indicate substantial variability across sample, suggesting notable data heterogeneity. Both our methods and Gemini 2.5 pro are very good in detecting Misspelling error (with c.a. 92%), but very bad on Punctuation, Informality, and To-split/To Merge errors (with c.a. 41%).
  • Figure 4: Confusion Matrix Heatmaps on AIReg-Bench. These heatmaps visualize the raw counts of True Positives (Detected Violation), False Positives (Detect Compliance system into violation), True Negatives (Detect compliant system correctly), and False Negatives (Detect violation into compliant) across 120 EU AI Systems test cases. Color intensity indicates count.
  • Figure 5: Cspelling Test Data Distribution. Both our methods and Gemini 2.5 pro are very good in detecting Misspelling error (with c.a. 92%), but very bad on Punctuation, Informality, and To-split/To Merge errors (with c.a. 41%).
  • ...and 2 more figures