Table of Contents
Fetching ...

A Closer Look into LLMs for Table Understanding

Jia Wang, Chuanyu Qin, Mingyu Zheng, Qingyi Si, Peize Li, Zheng Lin

Abstract

Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.

A Closer Look into LLMs for Table Understanding

Abstract

Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.
Paper Structure (45 sections, 8 equations, 27 figures, 6 tables)

This paper contains 45 sections, 8 equations, 27 figures, 6 tables.

Figures (27)

  • Figure 1: The attention dynamics of Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct on tabular tasks. Each input consists of three segments: system prompt, table content, and user question. Upper: A case study at selected layers---left sub-panels show step-wise attention ratio trend throughout generation steps; right sub-panels show cell-level attention heatmaps. Lower: Aggregated results---Layer-wise Segment Attn Ratio shows the proportion of attention allocated to each segment per layer; Table Attn Entropy measures the degree of focus towards table cells, with lower entropy indicating a more concentrated attention distribution on specific cells); Attn Contribution measures the influence of each segment on the final output. Notably, the entropy minimum at Layer 10 and Layer 14 for Llama3.1-8B-Instruct align with the concentrated attention on the answer cell "China" in the upper heatmap.
  • Figure 2: Performance drop when masking table attention in different layer ranges across 5 models (averaged over 3 benchmarks $\times$ 2 formats). Llama-family models show minimal impact from Deep masking, while Qwen-family models degrade uniformly across all ranges.
  • Figure 3: Prediction stability analysis using LogitLens. Bars show KL divergence between each layer's decoded distribution and the final output (lower = closer to final prediction). Lines show top-5 token overlap with the final output. Vertical dashed lines mark where predictions stabilize.
  • Figure 4: Expert activation analysis in MoE models across three architectures (DeepSeek-V2, Qwen3-30B-A3B, OLMoE). Left: Activation heatmaps showing table-specific experts per layer (darker = higher activation probability). Middle: Entropy of expert activation distribution for table tasks (lower entropy indicates concentrated activation on fewer experts). Right: Number of overlapping experts between table and math (GSM8K) tasks per layer.
  • Figure 5: Comparison between Markdown and HTML table formats.
  • ...and 22 more figures