Table of Contents
Fetching ...

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He

TL;DR

TRivia proposes a self-supervised fine-tuning framework for table recognition that learns from unlabeled table images via QA-based rewards and GRPO. It introduces an adaptive data engine with response-consistency sampling and attention-guided QA generation to produce diverse, verifiable supervision. The resulting TRivia-3B model achieves state-of-the-art TR performance across multiple benchmarks while remaining open-source and deployable offline. TRivia also demonstrates utility as a scalable data annotation tool that can distill high-quality pseudo-labels for downstream tasks.

Abstract

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

TL;DR

TRivia proposes a self-supervised fine-tuning framework for table recognition that learns from unlabeled table images via QA-based rewards and GRPO. It introduces an adaptive data engine with response-consistency sampling and attention-guided QA generation to produce diverse, verifiable supervision. The resulting TRivia-3B model achieves state-of-the-art TR performance across multiple benchmarks while remaining open-source and deployable offline. TRivia also demonstrates utility as a scalable data annotation tool that can distill high-quality pseudo-labels for downstream tasks.

Abstract

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia

Paper Structure

This paper contains 18 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: TRivia-3B learns from unlabeled table images and achieves TR quality beyond the limit attainable by fine-tuning with labeled data. Unlike proprietary systems such as Gemini 2.5 Pro, it is open-sourced, compact, and can be deployed offline for privacy-sensitive document processing.
  • Figure 2: TRivia features (a) an adaptive dataset curation module to set the stage for (b) reinforcement learning to learn TR from unlabeled data. During dataset curation, TRivia uses a response-consistency sampling strategy (Section \ref{['sec:sampling']}) to identify informative samples and generate verifiable, diverse QA for each image through an attention-guided module (Section \ref{['sec:qa_generation']}). Based on curated data, TRivia fine-tunes the VLM to recognize, structure, and reason over tables through QA-based rewards (Section \ref{['sec:rl_training']}).
  • Figure 3: Single-time QA generation captures limited table content, while multiple samplings introduce redundant or overlapping QA pairs. The proposed attention-guided QA generation leverages attention distributions to diversify question sources, producing concise and comprehensive QA pairs.
  • Figure 4: The teacher model $M_\text{QA}$ used for QA generation could not directly generate correct annotations, but is sufficient to create QAs with TRivia to fine-tune the base model and lead to TRivia-3B that can handle complex structures.
  • Figure 5: TRivia-3B benefits significantly from the diverse QAs generated by the attention-guided mechanism and the training samples that yield diverse outputs.
  • ...and 1 more figures