Table of Contents
Fetching ...

YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization

Dongsuk Jang, Alan Li, Arman Cohan

TL;DR

This work tackles perspective-aware healthcare QA summarization for PerAnsSumm by combining a training-based route (QLoRA fine-tuning of LLaMA-3.3-70B-Instruct) with a Mixture-of-Agents ensemble that integrates outputs from diverse LLMs. The study finds that GPT-4o zero-shot generally outperforms large open-source models on both span identification and perspective-based summarization, while MoA improves open-source model performance and embedding-based exemplar selection often surpasses manually curated exemplars. QLoRA fine-tuning did not provide the expected gains under the tested conditions, and a two-layer MoA configuration yielded the best trade-off between accuracy and reliability. The results highlight the potential of ensemble and prompt-based strategies for perspective-aware healthcare summarization, especially when data is scarce, though access to frontier LLMs remains a key performance driver. Future work will focus on data augmentation, more robust MoA designs, and dynamic prompting techniques to enhance generalizability and practicality in real-world healthcare contexts.

Abstract

Automated summarization of healthcare community question-answering forums is challenging due to diverse perspectives presented across multiple user responses to each question. The PerAnsSumm Shared Task was therefore proposed to tackle this challenge by identifying perspectives from different answers and then generating a comprehensive answer to the question. In this study, we address the PerAnsSumm Shared Task using two complementary paradigms: (i) a training-based approach through QLoRA fine-tuning of LLaMA-3.3-70B-Instruct, and (ii) agentic approaches including zero- and few-shot prompting with frontier LLMs (LLaMA-3.3-70B-Instruct and GPT-4o) and a Mixture-of-Agents (MoA) framework that leverages a diverse set of LLMs by combining outputs from multi-layer feedback aggregation. For perspective span identification/classification, GPT-4o zero-shot achieves an overall score of 0.57, substantially outperforming the 0.40 score of the LLaMA baseline. With a 2-layer MoA configuration, we were able to improve LLaMA performance up by 28 percent to 0.51. For perspective-based summarization, GPT-4o zero-shot attains an overall score of 0.42 compared to 0.28 for the best LLaMA zero-shot, and our 2-layer MoA approach boosts LLaMA performance by 32 percent to 0.37. Furthermore, in few-shot setting, our results show that the sentence-transformer embedding-based exemplar selection provides more gain than manually selected exemplars on LLaMA models, although the few-shot prompting is not always helpful for GPT-4o. The YaleNLP team's approach ranked the overall second place in the shared task.

YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization

TL;DR

This work tackles perspective-aware healthcare QA summarization for PerAnsSumm by combining a training-based route (QLoRA fine-tuning of LLaMA-3.3-70B-Instruct) with a Mixture-of-Agents ensemble that integrates outputs from diverse LLMs. The study finds that GPT-4o zero-shot generally outperforms large open-source models on both span identification and perspective-based summarization, while MoA improves open-source model performance and embedding-based exemplar selection often surpasses manually curated exemplars. QLoRA fine-tuning did not provide the expected gains under the tested conditions, and a two-layer MoA configuration yielded the best trade-off between accuracy and reliability. The results highlight the potential of ensemble and prompt-based strategies for perspective-aware healthcare summarization, especially when data is scarce, though access to frontier LLMs remains a key performance driver. Future work will focus on data augmentation, more robust MoA designs, and dynamic prompting techniques to enhance generalizability and practicality in real-world healthcare contexts.

Abstract

Automated summarization of healthcare community question-answering forums is challenging due to diverse perspectives presented across multiple user responses to each question. The PerAnsSumm Shared Task was therefore proposed to tackle this challenge by identifying perspectives from different answers and then generating a comprehensive answer to the question. In this study, we address the PerAnsSumm Shared Task using two complementary paradigms: (i) a training-based approach through QLoRA fine-tuning of LLaMA-3.3-70B-Instruct, and (ii) agentic approaches including zero- and few-shot prompting with frontier LLMs (LLaMA-3.3-70B-Instruct and GPT-4o) and a Mixture-of-Agents (MoA) framework that leverages a diverse set of LLMs by combining outputs from multi-layer feedback aggregation. For perspective span identification/classification, GPT-4o zero-shot achieves an overall score of 0.57, substantially outperforming the 0.40 score of the LLaMA baseline. With a 2-layer MoA configuration, we were able to improve LLaMA performance up by 28 percent to 0.51. For perspective-based summarization, GPT-4o zero-shot attains an overall score of 0.42 compared to 0.28 for the best LLaMA zero-shot, and our 2-layer MoA approach boosts LLaMA performance by 32 percent to 0.37. Furthermore, in few-shot setting, our results show that the sentence-transformer embedding-based exemplar selection provides more gain than manually selected exemplars on LLaMA models, although the few-shot prompting is not always helpful for GPT-4o. The YaleNLP team's approach ranked the overall second place in the shared task.

Paper Structure

This paper contains 40 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: PerAnsSumm Shared Task overview.
  • Figure 2: Performance comparison across different MoA layer counts.
  • Figure 3: Confusion Matrix for GPT-4o Zero-Shot on Task A. Each cell indicates the number of samples in the corresponding gold-predicted label pair.