Table of Contents
Fetching ...

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Rongjie Li, Yu Wu, Xuming He

TL;DR

The paper tackles the high labeling cost of second-stage instruction tuning for zero-shot vision-language tasks. It introduces Image-Conditioned Caption Correction (ICCC), a pre-training task that uses unlabeled image-text data and a lightweight dependency parser to generate concept-mismatched captions, guiding models to correct them conditioned on the image. By constructing a data pipeline with a concept extractor and a correction data constructor and balancing it with original data, ICCC improves zero-shot ITG-based tasks on BLIP-2 and InstructBLIP without extra annotations. The approach demonstrates substantial gains across VQA and image captioning benchmarks, highlighting a practical, scalable route to enhance cross-modal reasoning and generation.

Abstract

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

TL;DR

The paper tackles the high labeling cost of second-stage instruction tuning for zero-shot vision-language tasks. It introduces Image-Conditioned Caption Correction (ICCC), a pre-training task that uses unlabeled image-text data and a lightweight dependency parser to generate concept-mismatched captions, guiding models to correct them conditioned on the image. By constructing a data pipeline with a concept extractor and a correction data constructor and balancing it with original data, ICCC improves zero-shot ITG-based tasks on BLIP-2 and InstructBLIP without extra annotations. The approach demonstrates substantial gains across VQA and image captioning benchmarks, highlighting a practical, scalable route to enhance cross-modal reasoning and generation.

Abstract

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.
Paper Structure (31 sections, 2 equations, 5 figures, 7 tables)

This paper contains 31 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of second-stage tuning for zero-shot VL task adaptation comparison. The instruction tuning in recent works needs human label or LLM-generated data; in contrast, our image caption correction tuning is conducted on unlabeled image-text data with an NLP parser.
  • Figure 2: Illustration of the overall pipeline of ICCC. The concept extractor parses the sentence to obtain linguistic units of concepts. The task data constructor aims to produce the sample according to the sentence structure with the "replace" and "swap" operations. Finally, the generated ICCC data is used for image-to-text generative training for VLMs.
  • Figure 3: Hyper-parameter searching on $p_c$ and $p_s$.
  • Figure 4: Visualization results include model output examples and attention gradients on images. The first block illustrates three examples from the GQA testdev set, while the second block showcases three examples from the NoCaps validation set. With our training, the model demonstrates improved accuracy in focusing on prompt-relevant image regions. Additionally, it generates captions with more detailed descriptions of scenes and actions.
  • Figure 5: Examples of constructed mismatched captions categorized by modification concept type. The linguistic units operated by replace are highlighted in red, while the language units operated by swap are highlighted in orange.