Table of Contents
Fetching ...

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

Jacob Hansen, Wei Lin, Junmo Kang, Muhammad Jehanzeb Mirza, Hongyin Luo, Rogerio Feris, Alan Ritter, James Glass, Leonid Karlinsky

TL;DR

This work tackles the cost, reproducibility, and openness barriers in visual instruction tuning by introducing Instructify, an open-source pipeline that converts diverse image metadata from 40+ datasets into high-quality VisIT conversations using open LLMs. It introduces an ASCII-tree representation for grounded bounding-box data, a four-component workflow (loading/organization, formatting/conversion, quality-controlled instruction generation, and prompt management), and iterative, checked generation to ensure factual accuracy. Empirical results show open-model data can reproduce or exceed GPT-4–generated baselines, with up to 12% gains on certain benchmarks, and demonstrate strong scalability when expanding metadata sources and instruction counts. The approach emphasizes reproducibility, scalability, and broad applicability to niche domains, with code and pipeline details made openly available for the community.

Abstract

Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at https://github.com/jacob-hansen/Instructify.

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

TL;DR

This work tackles the cost, reproducibility, and openness barriers in visual instruction tuning by introducing Instructify, an open-source pipeline that converts diverse image metadata from 40+ datasets into high-quality VisIT conversations using open LLMs. It introduces an ASCII-tree representation for grounded bounding-box data, a four-component workflow (loading/organization, formatting/conversion, quality-controlled instruction generation, and prompt management), and iterative, checked generation to ensure factual accuracy. Empirical results show open-model data can reproduce or exceed GPT-4–generated baselines, with up to 12% gains on certain benchmarks, and demonstrate strong scalability when expanding metadata sources and instruction counts. The approach emphasizes reproducibility, scalability, and broad applicability to niche domains, with code and pipeline details made openly available for the community.

Abstract

Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at https://github.com/jacob-hansen/Instructify.

Paper Structure

This paper contains 30 sections, 8 figures, 10 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of Instructify. We present a unified framework that automatically transforms diverse metadata from publicly available datasets into multimodal instruction-tuning conversations. Our approach merges metadata from multiple sources, categorizing it into captions, bounding boxes, and question-answer (QA) pairs. While LLMs effectively convert captions and QA into natural language, we find that grounded annotations - such as bounding boxes - benefit significantly from a hierarchical representation using ASCII tree, capturing the geometric and semantic structure of the image. The serialized metadata then undergoes an iterative instruction-generation process, where context is refined and quality is rigorously controlled, with an LLM in the loop.
  • Figure 2: Pipeline of our method. We start with loading and organizing multiple open-source datasets containing captions, question answers, and bounding boxes as metadata. Next, we reformat the metadata in a unified natural language interface which is then converted into multi-modal conversations. These LLM-generated conversations undergo an iterative refinement procedure involving multiple automated tests for quality improvement.
  • Figure 3: ASCII tree example. Our approach automatically collects all available metadata from the available sources (top right) for the image (left). As detailed in the text, we propose a way to re-organize this information into a hierarchical textual data structure - the ASCII tree (middle right) that is effectively converted by an LLM into a detailed high-quality context comprised of multiple factual statements organized in sentences, densely describing the image objects, their relations, and attributes. The resulting context is later fed into our iterative instruction turn generation process detailed in the text.
  • Figure 4: ASCII tree example. Our approach automatically collects metadata from different available for the image. The ASCII Tree (top right) is a hierarchical textual data structure that includes the attributes, positions, sizes and depth information of objects in the image. The ASCII tree is further converted by an LLM into a high-quality dense description (bottom) consisting of multiple factual statements.
  • Figure 5: LLaVA prompt.
  • ...and 3 more figures