Continually Self-Improving Language Models for Bariatric Surgery Question--Answering

Yash Kumar Atri; Thomas H Shin; Thomas Hartvigsen

Continually Self-Improving Language Models for Bariatric Surgery Question--Answering

Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen

TL;DR

This work introduces bRAGgen, an adaptive retrieval-augmented generation framework for bariatric surgery question answering, which autonomously integrates up-to-date medical evidence when response confidence declines. Complementing it, bRAGq provides a large, expert-validated dataset of 1,302 postoperative bariatric questions to benchmark domain-specific QA. Through semantic caching, MD P-guided web retrieval, LoRA-enhanced generation, online learning, and safety constraints, bRAGgen outperforms state-of-the-art baselines across expert and LLM-as-Judge evaluations, demonstrating superior factuality, relevance, and comprehensiveness. The approach promises scalable, evidence-based, patient-centric support across preoperative to long-term postoperative bariatric care, with broader implications for continual learning in healthcare AI.

Abstract

While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery--related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)--based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.

Continually Self-Improving Language Models for Bariatric Surgery Question--Answering

TL;DR

Abstract

Continually Self-Improving Language Models for Bariatric Surgery Question--Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)