UMB@PerAnsSumm 2025: Enhancing Perspective-Aware Summarization with Prompt Optimization and Supervised Fine-Tuning
Kristin Qi, Youxiang Zhu, Xiaohui Liang
TL;DR
This work tackles perspective span identification and perspective-aware summarization in medical CQA by combining an ensemble of transformer models for span labeling with Chain-of-Thought prompting to produce structured, perspective-aligned summaries. It further augments summary quality through DSPy-based automatic prompt optimization and supervised fine-tuning of Llama-3 using LoRA on domain-specific data. Empirical results show the ensemble achieves strong span-F1 (82.9% on test, 83.9% on validation), while the DSPy+CoT prompts and SFT yield meaningful gains in both relevance and factuality metrics, indicating the effectiveness of iterative prompt refinement and domain adaptation for complex, multi-perspective medical summaries. The approach demonstrates practical impact for efficient, accurate extraction and synthesis of diverse viewpoints in medical CQA contexts and highlights avenues for comparing with larger LLMs and refining metric-driven optimization.
Abstract
We present our approach to the PerAnsSumm Shared Task, which involves perspective span identification and perspective-aware summarization in community question-answering (CQA) threads. For span identification, we adopt ensemble learning that integrates three transformer models through averaging to exploit individual model strengths, achieving an 82.91% F1-score on test data. For summarization, we design a suite of Chain-of-Thought (CoT) prompting strategies that incorporate keyphrases and guide information to structure summary generation into manageable steps. To further enhance summary quality, we apply prompt optimization using the DSPy framework and supervised fine-tuning (SFT) on Llama-3 to adapt the model to domain-specific data. Experimental results on validation and test sets show that structured prompts with keyphrases and guidance improve summaries aligned with references, while the combination of prompt optimization and fine-tuning together yields significant improvement in both relevance and factuality evaluation metrics.
