Table of Contents
Fetching ...

AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

Jiashu Yang, Chi Zhang, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xu Jia, Xunliang Cai

TL;DR

This work proposes AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models, and introduces a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty.

Abstract

Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models. Specifically, we introduce a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty. Furthermore, to overcome the reasoning boundaries of VLM, we propose a functional decoupling architecture: a small-scale VLM serves as a high-fidelity visual interpreter to transform query-relevant visual cues into textual representations, which are subsequently processed by an LLM for logical deduction and synthesis. Extensive experiments on DocBench and MMLongBench demonstrate that AutoThinkRAG significantly reduces inference costs while achieving new state-of-the-art performance. Further ablation studies verifies the effectiveness of our proposed method.

AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

TL;DR

This work proposes AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models, and introduces a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty.

Abstract

Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models. Specifically, we introduce a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty. Furthermore, to overcome the reasoning boundaries of VLM, we propose a functional decoupling architecture: a small-scale VLM serves as a high-fidelity visual interpreter to transform query-relevant visual cues into textual representations, which are subsequently processed by an LLM for logical deduction and synthesis. Extensive experiments on DocBench and MMLongBench demonstrate that AutoThinkRAG significantly reduces inference costs while achieving new state-of-the-art performance. Further ablation studies verifies the effectiveness of our proposed method.
Paper Structure (26 sections, 1 equation, 3 figures, 3 tables)

This paper contains 26 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our AutoThink--RAG. The framework consists of four primary stages: (1) Information Extraction: Heterogeneous multimodal data are ingested. (2) Data Processing: Through layout analysis, the system partitions data into text chunks and image indices, followed by an iterative "Extract--Judge" loop to construct a Graph Knowledge Base. (3) Data Linker: A hybrid retrieval mechanism integrates traditional Top--K vector search with graph--based merging to capture complex entity relationships. (4) Reasoning & Response: The Query Router decomposes the original query into sub-queries ($Q_1, Q_2, Q_3$), which are then processed through a decoupled perception reasoning architecture leveraging both "related text" and visual cues to synthesize the final response.
  • Figure 2: Comparison of routing strategy distributions across different document lengths. This figure illustrates the selection proportions of graphs and hypergraphs within length intervals under Think and No-Think modes.
  • Figure 3: The approach of converting VLM outputs to LLM leads to a marked boost in accuracy when processing multi-page long documents.