Table of Contents
Fetching ...

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

Zhenyu Pan, Haozheng Luo, Manling Li, Han Liu

TL;DR

Chain-of-Action (CoA) introduces a modular, plug-and-play reasoning-retrieval framework for multimodal QA that decomposes complex questions into action chains controlled by in-context prompts. It integrates three domain-adaptable actions—web-querying, knowledge-encoding, and data-analyzing—and uses a Multi-Reference Faith Score (MRFS) to verify answers against retrieved data, reducing unfaithful outputs and token usage. The approach requires no additional training and demonstrates superior performance on classical QA benchmarks and a Web3 case study, including real-time information retrieval. This work advances faithful, efficient real-world QA by enabling external-grounded reasoning across text and tabular data, with extensibility to new modalities.

Abstract

We present a Chain-of-Action (CoA) framework for multimodal and retrieval-augmented Question-Answering (QA). Compared to the literature, CoA overcomes two major challenges of current QA applications: (i) unfaithful hallucination that is inconsistent with real-time or domain facts and (ii) weak reasoning performance over compositional information. Our key contribution is a novel reasoning-retrieval mechanism that decomposes a complex question into a reasoning chain via systematic prompting and pre-designed actions. Methodologically, we propose three types of domain-adaptable `Plug-and-Play' actions for retrieving real-time information from heterogeneous sources. We also propose a multi-reference faith score (MRFS) to verify and resolve conflicts in the answers. Empirically, we exploit both public benchmarks and a Web3 case study to demonstrate the capability of CoA over other methods.

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

TL;DR

Chain-of-Action (CoA) introduces a modular, plug-and-play reasoning-retrieval framework for multimodal QA that decomposes complex questions into action chains controlled by in-context prompts. It integrates three domain-adaptable actions—web-querying, knowledge-encoding, and data-analyzing—and uses a Multi-Reference Faith Score (MRFS) to verify answers against retrieved data, reducing unfaithful outputs and token usage. The approach requires no additional training and demonstrates superior performance on classical QA benchmarks and a Web3 case study, including real-time information retrieval. This work advances faithful, efficient real-world QA by enabling external-grounded reasoning across text and tabular data, with extensibility to new modalities.

Abstract

We present a Chain-of-Action (CoA) framework for multimodal and retrieval-augmented Question-Answering (QA). Compared to the literature, CoA overcomes two major challenges of current QA applications: (i) unfaithful hallucination that is inconsistent with real-time or domain facts and (ii) weak reasoning performance over compositional information. Our key contribution is a novel reasoning-retrieval mechanism that decomposes a complex question into a reasoning chain via systematic prompting and pre-designed actions. Methodologically, we propose three types of domain-adaptable `Plug-and-Play' actions for retrieving real-time information from heterogeneous sources. We also propose a multi-reference faith score (MRFS) to verify and resolve conflicts in the answers. Empirically, we exploit both public benchmarks and a Web3 case study to demonstrate the capability of CoA over other methods.
Paper Structure (34 sections, 3 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 3 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Chain-of-action prompting empowers LLMs to generate (1) faithful, informative, concrete analysis grounded in heterogeneous sources (open web, domain knowledge, tabular data, etc.) as well as (2) well-reasoned chains for complex questions to better interpret human goals and intents. This stands superior to previous approaches that yield generic, ambiguous, high-level responses.
  • Figure 2: Overview of Chain-of-Action framework. We use in-context learning to prompt LLM to generate the action chain. The chain has many nodes consisting of sub-questions (Sub), missing flags (MF), and LLM-generated guess answers (A). Then, the actions address multimodal retrieval of the nodes in three steps: (i) retrieving related information, (ii) verifying whether the LLM-generated answer needs correction by retrieval, and (iii) checking if we need to fill in missing contents with the retrieval. Finally, we generate the final answer by the LLM based on the processed action chain.
  • Figure 3: Two samples from our Chain-of-Action Framework.
  • Figure 4: Prompt to Generate Action Chain in Chain-of-Action (CoA). This template integrates the user's question along with a description of each available action. The resulting action chain comprises elements such as actions, subs, guess answers and missing flags. This prompt not only decomposes complex questions into multiple sub-questions, guided by the features of the actions but also allows the LLM to answer certain sub-questions using its existing inner-knowledge. This process exemplifies our proposed reasoning-retrieval mechanism.
  • Figure 5: The pseudo codes about how to calculate the MRFS.
  • ...and 6 more figures