Table of Contents
Fetching ...

HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization

Zhenghao Liu, Haolan Wang, Xinze Li, Qiushi Xiong, Xiaocui Yang, Yu Gu, Yukun Yan, Qi Shi, Fangfang Li, Ge Yu, Maosong Sun

TL;DR

HIPPO tackles the challenge of tabular reasoning in large language models by jointly leveraging text-based and image-based table representations. It introduces Hybrid-Modal Preference Optimization with modality-consistent sampling and Direct Preference Optimization to train MLLMs, achieving over $4\%$ improvements on TQA and TFV and demonstrating robustness across unimodal and multimodal inputs. The method constructs a diverse training signal by sampling from multiple modalities and selecting representative negatives, mitigating modality bias and enriching semantic extraction from tables. Empirical results across TQA and TFV tasks, along with ablations and case studies, show that HIPPO both strengthens multi-modal reasoning and generalizes to unimodal table representations, highlighting the value of integrating textual and visual table semantics for robust table understanding.

Abstract

Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. To better capture these structural semantics, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, and optimizes MLLMs to effectively learn more comprehensive table information from these multiple modalities. Specifically, HIPPO samples model responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during DPO training. Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances reasoning abilities based on unimodal table representations but also facilitates the extraction of crucial and distinct semantics from different modal representations. All data and codes are available at https://github.com/NEUIR/HIPPO.

HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization

TL;DR

HIPPO tackles the challenge of tabular reasoning in large language models by jointly leveraging text-based and image-based table representations. It introduces Hybrid-Modal Preference Optimization with modality-consistent sampling and Direct Preference Optimization to train MLLMs, achieving over improvements on TQA and TFV and demonstrating robustness across unimodal and multimodal inputs. The method constructs a diverse training signal by sampling from multiple modalities and selecting representative negatives, mitigating modality bias and enriching semantic extraction from tables. Empirical results across TQA and TFV tasks, along with ablations and case studies, show that HIPPO both strengthens multi-modal reasoning and generalizes to unimodal table representations, highlighting the value of integrating textual and visual table semantics for robust table understanding.

Abstract

Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. To better capture these structural semantics, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, and optimizes MLLMs to effectively learn more comprehensive table information from these multiple modalities. Specifically, HIPPO samples model responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during DPO training. Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances reasoning abilities based on unimodal table representations but also facilitates the extraction of crucial and distinct semantics from different modal representations. All data and codes are available at https://github.com/NEUIR/HIPPO.

Paper Structure

This paper contains 20 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of the Effectiveness of Text-Based and Image-Based Table Representations in Question Answering. We present the answers generated by the MLLM () based on both text-based () and image-based () table representations.
  • Figure 2: The Framework of Our HIPPO Method.
  • Figure 3: Output Similarity of Models Between Unimodal and Multi-Modal Table Representations. The TAT-QA dataset is used for evaluation.
  • Figure 4: Performance of Different Models Based on Unimodal Table Representations.
  • Figure 5: Case Study. The correct reasoning, incorrect reasoning, and final answer are highlighted.
  • ...and 3 more figures