Table of Contents
Fetching ...

How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

Jiancheng Dong, Pengyue Jia, Derong Xu, Jiawei Cheng, Jingyu Peng, Chao Zhang, Bowen Liu, Xin Sun, Lixin Su, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

Abstract

LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision--text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9\% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.

How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

Abstract

LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision--text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9\% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.
Paper Structure (46 sections, 5 equations, 5 figures, 5 tables)

This paper contains 46 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Pilot study on TCR. “Any-Correct” denotes the proportion of examples for which either the vision or text modality yields the correct answer.
  • Figure 2: The overall architecture of DiVA-Former. The model leverages spatial visual tokens ($\boldsymbol{v}_i$) extracted from a table image as dynamic queries. These queries attend to the linearized table text tokens ($\boldsymbol{c}_j$) via Cross-Attention to generate compact digest vectors ($\boldsymbol{d}_i$). A gating mechanism maintains an approximate identity mapping from the initial visual tokens to the output digest, preserving high-quality 2D spatial priors.
  • Figure 3: Prediction alignment barcode for the subset of examples where both unimodal settings fail. Each column represents a single example, and the color intensity indicates the number of correctly localized target coordinates. We categorize the comparison between DiVA-Former and the direct concat baseline into Win, Tie, and Loss groups. DiVA-Former successfully recovers substantially more of these difficult examples than direct concat, demonstrating that its performance gains stem from leveraging complementary cross-modal information to solve cases that unimodal approaches cannot.
  • Figure 4: Hyperparameter analysis of DiVA-Former. Left: training loss curves under different initial gate values $g_0$. Middle: final average score across the 13 benchmarks under different initial gate values $g_0$. Right: final average score under different numbers of DiVA-Former layers.
  • Figure 5: Full-set per-example prediction alignment analysis on TCR. The left region corresponds to examples where DiVA-Former outperforms direct concatenation, the middle region indicates ties, and the right region contains examples where DiVA-Former underperforms. Notably, many examples in the left region are cases that unimodal models and direct concatenation fail to solve at all, while even on some examples in the right region DiVA-Former still recovers a substantial number of coordinates.