Table of Contents
Fetching ...

Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering

Mingxu Tao, Dongyan Zhao, Yansong Feng

TL;DR

This paper introduces Chain-of-Discussion (CoD), a multi-model framework that enables multiple open-source LLMs to summarize, criticize, and revise each other's outputs to improve complex evidence-based QA. By combining two stages—question analysis and evidence analysis—with a structured critique and revision loop, CoD enhances correctness and comprehensiveness, particularly in legal consultation tasks. The authors provide a Chinese civil-law dataset (200 questions) and show that CoD yields improvements on evidence-centric metrics and human evaluations, though performance varies by model size and capability. The work demonstrates the viability of collaborative reasoning among small LLMs to mitigate hallucination and broaden scenario coverage, offering a practical path toward reliable, evidence-grounded open-source QA systems.

Abstract

Open-ended question answering requires models to find appropriate evidence to form wellreasoned, comprehensive and helpful answers. In practical applications, models also need to engage in extended discussions on potential scenarios closely relevant to the question. With augmentation of retrieval module, open-source Large Language Models (LLMs) can produce coherent answers often with different focuses, but are still sub-optimal in terms of reliable evidence selection and in-depth question analysis. In this paper, we propose a novel Chain-ofDiscussion framework to leverage the synergy among multiple open-source LLMs aiming to provide more correct and more comprehensive answers for open-ended QA, although they are not strong enough individually. Our experiments show that discussions among multiple LLMs play a vital role in enhancing the quality of answers.

Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering

TL;DR

This paper introduces Chain-of-Discussion (CoD), a multi-model framework that enables multiple open-source LLMs to summarize, criticize, and revise each other's outputs to improve complex evidence-based QA. By combining two stages—question analysis and evidence analysis—with a structured critique and revision loop, CoD enhances correctness and comprehensiveness, particularly in legal consultation tasks. The authors provide a Chinese civil-law dataset (200 questions) and show that CoD yields improvements on evidence-centric metrics and human evaluations, though performance varies by model size and capability. The work demonstrates the viability of collaborative reasoning among small LLMs to mitigate hallucination and broaden scenario coverage, offering a practical path toward reliable, evidence-grounded open-source QA systems.

Abstract

Open-ended question answering requires models to find appropriate evidence to form wellreasoned, comprehensive and helpful answers. In practical applications, models also need to engage in extended discussions on potential scenarios closely relevant to the question. With augmentation of retrieval module, open-source Large Language Models (LLMs) can produce coherent answers often with different focuses, but are still sub-optimal in terms of reliable evidence selection and in-depth question analysis. In this paper, we propose a novel Chain-ofDiscussion framework to leverage the synergy among multiple open-source LLMs aiming to provide more correct and more comprehensive answers for open-ended QA, although they are not strong enough individually. Our experiments show that discussions among multiple LLMs play a vital role in enhancing the quality of answers.
Paper Structure (42 sections, 2 equations, 2 figures, 8 tables)

This paper contains 42 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: The process of Chain-of-Discussion (bottom part), compared with chain-of-thought (middle part). The green parts are necessary to answer user's question. Blue parts indicate closely related to the question, useful for detailed/extended discussions. Red parts are irrelevant contents that should be avoided.
  • Figure 2: Human preference evaluation, comparing the CoD settings of Baichuan2-7B and Xverse-7B to their corresponding baseline settings across 50 randomly sampled examples.