Table of Contents
Fetching ...

TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding

Xiaobo Xing, Wei Yuan, Tong Chen, Quoc Viet Hung Nguyen, Xiangliang Zhang, Hongzhi Yin

TL;DR

TableDART tackles the challenge of tabular data understanding by introducing dynamic adaptive routing that selects, per instance, among Text-only, Image-only, or Fusion pathways. It reuses frozen single-modality experts and trains only a compact $2.59\mathrm{M}$ parameter gating network, augmented by a Fusion agent for cross-modal synthesis when needed, thereby achieving high accuracy with improved training efficiency. The method employs a resource-aware objective to balance performance and inference cost, and demonstrates state-of-the-art results on seven benchmarks with strong generalization, including zero-shot settings. Overall, TableDART provides a practical, plug-and-play framework for multimodal table understanding that avoids heavy MLLM fine-tuning while delivering robust, dataset-adaptive reasoning capabilities.

Abstract

Modeling semantic and structural information from tabular data remains a core challenge for effective table understanding. Existing Table-as-Text approaches flatten tables for large language models (LLMs), but lose crucial structural cues, while Table-as-Image methods preserve structure yet struggle with precise semantics. Recent Table-as-Multimodality strategies attempt to combine textual and visual views, but they (1) statically process both modalities for every query-table pair within large multimodal LLMs (MLLMs), inevitably introducing redundancy and even conflicts, and (2) depend on costly fine-tuning of MLLMs. In light of this, we propose TableDART, a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models. TableDART introduces a lightweight 2.59M-parameter MLP gating network that dynamically selects the optimal path (Text-only, Image-only, or Fusion) for each table-query pair, reducing redundancy and avoiding conflicts that arise when textual and visual views of the same table provide inconsistent cues. By routing to the most appropriate view, our framework improves both accuracy and efficiency. In addition, we propose a novel agent to mediate cross-modal knowledge integration by analyzing outputs from text- and image-based models, either selecting the best result or synthesizing a new answer through reasoning. This design avoids the prohibitive costs of full MLLM fine-tuning. Extensive experiments on seven benchmarks show that TableDART establishes new state-of-the-art performance among open-source models, surpassing the strongest baseline by an average of 4.02%. The code is available at: https://github.com/xiaobo-xing/TableDART.

TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding

TL;DR

TableDART tackles the challenge of tabular data understanding by introducing dynamic adaptive routing that selects, per instance, among Text-only, Image-only, or Fusion pathways. It reuses frozen single-modality experts and trains only a compact parameter gating network, augmented by a Fusion agent for cross-modal synthesis when needed, thereby achieving high accuracy with improved training efficiency. The method employs a resource-aware objective to balance performance and inference cost, and demonstrates state-of-the-art results on seven benchmarks with strong generalization, including zero-shot settings. Overall, TableDART provides a practical, plug-and-play framework for multimodal table understanding that avoids heavy MLLM fine-tuning while delivering robust, dataset-adaptive reasoning capabilities.

Abstract

Modeling semantic and structural information from tabular data remains a core challenge for effective table understanding. Existing Table-as-Text approaches flatten tables for large language models (LLMs), but lose crucial structural cues, while Table-as-Image methods preserve structure yet struggle with precise semantics. Recent Table-as-Multimodality strategies attempt to combine textual and visual views, but they (1) statically process both modalities for every query-table pair within large multimodal LLMs (MLLMs), inevitably introducing redundancy and even conflicts, and (2) depend on costly fine-tuning of MLLMs. In light of this, we propose TableDART, a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models. TableDART introduces a lightweight 2.59M-parameter MLP gating network that dynamically selects the optimal path (Text-only, Image-only, or Fusion) for each table-query pair, reducing redundancy and avoiding conflicts that arise when textual and visual views of the same table provide inconsistent cues. By routing to the most appropriate view, our framework improves both accuracy and efficiency. In addition, we propose a novel agent to mediate cross-modal knowledge integration by analyzing outputs from text- and image-based models, either selecting the best result or synthesizing a new answer through reasoning. This design avoids the prohibitive costs of full MLLM fine-tuning. Extensive experiments on seven benchmarks show that TableDART establishes new state-of-the-art performance among open-source models, surpassing the strongest baseline by an average of 4.02%. The code is available at: https://github.com/xiaobo-xing/TableDART.

Paper Structure

This paper contains 41 sections, 6 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Architecture of TableDART. The framework operates in three main stages: Multimodal Encoding (Section \ref{['sec:encoding_multimodality']}), Gating Network (Section \ref{['sec:gating_training']}), and Dynamic Inference Pathways (Section \ref{['sec:inference_multimodality']}).
  • Figure 2: Performance analysis of inference paths. (a) The chart counts instances based on which path(s), if any, produced a correct answer, across all datasets. (b) A per-dataset analysis of two key metrics: the Complementarity Rate, which is the percentage of instances where correctness is achieved by only one of the two single-modality models, and the Synergy Success Rate, which measures the fraction of hard cases (instances where both single-modality models fail) that are successfully resolved by the Fusion path.
  • Figure 3: Inference path selection distribution vs. the resource loss weight ($\lambda$) on two representative benchmarks. Each bar shows the percentage of instances routed to the Text-only (blue), Image-only (green), and Fusion (orange) paths. A red star (*) marks the configuration with the highest performance for each dataset (see Table \ref{['tab:lambda_ablation']}). This selection highlights TableDART's adaptability to diverse challenges. Full results on all seven datasets are provided in Appendix \ref{['sec:appendix_full_charts']}.
  • Figure 4: Case studies illustrate the key synthesis roles of the Fusion Model, which is implemented by an LLM agent. (a) As an Arbitrator (example from the TABMWP dataset), it resolves a conflict between a correct and an incorrect numerical reasoning path. (b) As a Rescuer (example from the HiTab dataset), it demonstrates synergy by synthesizing a correct answer from two distinct, incorrect outputs, showcasing its ability to combine partially correct reasoning fragments.
  • Figure 5: The data flow and logical structure of the prompt for the Fusion path's LLM agent. This structure guides the synthesis of a final answer from the outputs of the single-modality models.
  • ...and 6 more figures