Table of Contents
Fetching ...

CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart

Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, Xiaobo Zhang

TL;DR

CT2C-QA is presented, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages, which serves as a great test for the capability of the model to analyze and reason with multimodal data.

Abstract

Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present C$\text{T}^2$C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (\textbf{A}llocating, \textbf{E}xpert and \textbf{D}esicion), a multi-agent system implemented through collaborative deployment, information interaction, and collective decision-making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state-of-the-art models in MMQA, including GPT-4. The experimental outcomes demonstrate that current methodologies, including GPT-4, are yet to meet the benchmarks set by our dataset.

CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart

TL;DR

CT2C-QA is presented, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages, which serves as a great test for the capability of the model to analyze and reason with multimodal data.

Abstract

Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present CC-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (\textbf{A}llocating, \textbf{E}xpert and \textbf{D}esicion), a multi-agent system implemented through collaborative deployment, information interaction, and collective decision-making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state-of-the-art models in MMQA, including GPT-4. The experimental outcomes demonstrate that current methodologies, including GPT-4, are yet to meet the benchmarks set by our dataset.

Paper Structure

This paper contains 17 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Example of a C$\text{T}^2$C-QA question, answer and context. The distinct keywords in the question are highlighted using various colors. Corresponding information on the webpage is similarly marked with matching colors for easy reference. The answer is specifically indicated with a red font. Each question is associated with a webpage, where the answer might reside in various modal data forms within that page, or it might be that the answer cannot be deduced from the available information. In the example question, the webpage related to the question contains text, three charts and three tables at the same time, and the answer to the question can be found from the text and the table, but there is no relevant information in the chart.
  • Figure 2: An illustration of the dataset construction. The orange box represents text data, the pink box contains tables and the purple box contains charts. Following format conversion, these data types are stored within the same Markdown file but in distinct formats. Each chart tag is linked to a local storage path for the corresponding chart and an image bed.
  • Figure 3: The categories of questions in C$\text{T}^2$C-QA for 6 most common first words (statistics after translation).
  • Figure 4: Distribution of domains in StatChina.
  • Figure 5: The overall architecture of AED, which functions by processing both the entirety of webpage content and a question. a) The overview of AED, which displays the interplay and scheduling amongst these various agents. b) The structure of each agent. Different agents within the system are color-coded for clarity: The Allocating Agent is represented in pink. It serves as the initial distributor of tasks and information. The Text Expert Agent, indicated in blue, specializes in handling and interpreting textual content. The Table Expert Agent, shown in green, is focused on processing and understanding table-based information. The Chart Expert Agent, depicted in purple, is adept in analyzing chart data. The Decision Agent, highlighted in yellow, makes final determinations.
  • ...and 1 more figures