Table of Contents
Fetching ...

TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim

TL;DR

TabFlash tackles the unique challenges of table image understanding by making visual features question-aware and compact. It introduces progressive question conditioning to inject questions into ViT layers with increasing frequency, a background pruning strategy based on L2 norms to discard uninformative tokens, and a token focusing training objective to minimize information loss from pruning. Combined, these components yield state-of-the-art results on seven table QA benchmarks while cutting FLOPs by about 27% and memory by about 30% versus the second-best open-source model. The approach also outperforms several proprietary models, demonstrating strong practical impact for efficient table reasoning in multimodal systems.

Abstract

Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

TL;DR

TabFlash tackles the unique challenges of table image understanding by making visual features question-aware and compact. It introduces progressive question conditioning to inject questions into ViT layers with increasing frequency, a background pruning strategy based on L2 norms to discard uninformative tokens, and a token focusing training objective to minimize information loss from pruning. Combined, these components yield state-of-the-art results on seven table QA benchmarks while cutting FLOPs by about 27% and memory by about 30% versus the second-best open-source model. The approach also outperforms several proprietary models, demonstrating strong practical impact for efficient table reasoning in multimodal systems.

Abstract

Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

Paper Structure

This paper contains 29 sections, 11 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Performance-cost comparison. TFLOPs (x-axis) and average accuracy on 7 benchmarks (y-axis) are plotted. We propose TabFlash, an efficient MLLM with superior table understanding capability (Tab. \ref{['tab:main_table']}) with significantly lower computational cost and GPU memory usage (Tab. \ref{['tab:cost_comparison']}).
  • Figure 2: Overall pipeline of TabFlash. Progressive question conditioning injects question information into ViT layers with a progressively increasing frequency, producing a question-relevant visual token set $\mathbf{V}$ (Sec. \ref{['sec:progressive_q_cond']}). The tokens are divided into a pruned set $\mathbf{V}_p$ and a retained set $\mathbf{V}_r$, where only $\mathbf{V}_r$ is used during inference for efficiency. To concentrate information in $\mathbf{V}_r$, token focusing encourages accurate prediction with $\mathbf{V}_r$ while suppressing prediction using $\mathbf{V}_p$ (Sec. \ref{['sec:token_focusing']}).
  • Figure 3: Visualization of $L_2$ norms of ViT output tokens (left) and norm-based pruning results (right).Red and blue color denotes high and low$L_2$ norms, respectively. 30% of tokens with the lowest norms are pruned ($p=0.3)$. More examples provided in the supplementary material.
  • Figure 4: Qualitative results. Near-white regions indicate low attention, while stronger red colors represent higher attention scores. Best viewed when zoomed in. Please refer to the supplementary material for further qualitative results.
  • Figure C1: Further visualization of $L_2$ norms of ViT output tokens (top) and norm-based pruning results (bottom).Red and blue color denotes high and low$L_2$ norms, respectively. 30% of tokens with the lowest norms are pruned ($p=0.3)$.
  • ...and 1 more figures