HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization
Zhenghao Liu, Haolan Wang, Xinze Li, Qiushi Xiong, Xiaocui Yang, Yu Gu, Yukun Yan, Qi Shi, Fangfang Li, Ge Yu, Maosong Sun
TL;DR
HIPPO tackles the challenge of tabular reasoning in large language models by jointly leveraging text-based and image-based table representations. It introduces Hybrid-Modal Preference Optimization with modality-consistent sampling and Direct Preference Optimization to train MLLMs, achieving over $4\%$ improvements on TQA and TFV and demonstrating robustness across unimodal and multimodal inputs. The method constructs a diverse training signal by sampling from multiple modalities and selecting representative negatives, mitigating modality bias and enriching semantic extraction from tables. Empirical results across TQA and TFV tasks, along with ablations and case studies, show that HIPPO both strengthens multi-modal reasoning and generalizes to unimodal table representations, highlighting the value of integrating textual and visual table semantics for robust table understanding.
Abstract
Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. To better capture these structural semantics, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, and optimizes MLLMs to effectively learn more comprehensive table information from these multiple modalities. Specifically, HIPPO samples model responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during DPO training. Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances reasoning abilities based on unimodal table representations but also facilitates the extraction of crucial and distinct semantics from different modal representations. All data and codes are available at https://github.com/NEUIR/HIPPO.
