Table of Contents
Fetching ...

Data Metabolism: An Efficient Data Design Schema For Vision Language Model

Jingyuan Zhang, Hongzhi Zhang, Zhou Haonan, Chenxi Sun, Xingguang ji, Jiakang Wang, Fanheng Kong, Yahui Liu, Qi Wang, Fuzheng Zhang

TL;DR

The paper tackles data quality and its central role in Visual Language Model performance by introducing Data Metabolism, a data-centric, closed-loop lifecycle with Data Anabolism (data construction and quality enhancement) and Data Catabolism (diagnosis and dataset updating). It provides actionable steps, a codebook-style data processing recipe, and demonstrates Capybara-VL-7B, a compact VLM that matches or exceeds larger open-source and some proprietary models on diverse multimodal tasks. The work combines iterative data filtering, answer augmentation with CoT data, and incremental validation to show that targeted data improvements yield outsized gains, including notable gains in mathematical and scientific reasoning as well as OCR/text-rich understanding. This approach underscores the practical impact of data-centric design in building smaller, efficient VLMs and sets a framework for ongoing dataset refinement guided by model behavior and diagnostics.

Abstract

Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.

Data Metabolism: An Efficient Data Design Schema For Vision Language Model

TL;DR

The paper tackles data quality and its central role in Visual Language Model performance by introducing Data Metabolism, a data-centric, closed-loop lifecycle with Data Anabolism (data construction and quality enhancement) and Data Catabolism (diagnosis and dataset updating). It provides actionable steps, a codebook-style data processing recipe, and demonstrates Capybara-VL-7B, a compact VLM that matches or exceeds larger open-source and some proprietary models on diverse multimodal tasks. The work combines iterative data filtering, answer augmentation with CoT data, and incremental validation to show that targeted data improvements yield outsized gains, including notable gains in mathematical and scientific reasoning as well as OCR/text-rich understanding. This approach underscores the practical impact of data-centric design in building smaller, efficient VLMs and sets a framework for ongoing dataset refinement guided by model behavior and diagnostics.

Abstract

Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.

Paper Structure

This paper contains 40 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: A concise illustration of data metabolism, where training data are iteratively collected, filtered, and improved, while diagnosing the model to guide actions for the next iteration.
  • Figure 2: Comparisons on the data used in the three training stages before and after Data Metabolism. We refer to Appendix for the details on the data categories (i.e., Table \ref{['tab:data_mixture_stage2']} and Table \ref{['tab:data_mixture_stage3']}).
  • Figure 3: (A) It is necessary to decompose the inference process and attribute results step-by-step for complex tasks. (B) For tasks with large distribution gaps, additional targeted data are needed to improve coverage. (C)-(F) Examples of common methods for diagnosing data issues.
  • Figure 4: Samples filtered by our pipline: (a) repetition generation, (b) abnormal behavior of teacher model, (c) question-image mismatch, (d) hallucination from synthetic data.
  • Figure 5: Quality improvement through cot and answer rewriting
  • ...and 1 more figures