Table of Contents
Fetching ...

Altogether: Image Captioning via Re-aligning Alt-text

Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer

TL;DR

Altogether introduces a principled framework to improve image captions by re-aligning existing alt-text with image content. It combines multi-round human annotation of alt-text with a lightweight, scalable captioner that grounding alt-text into dense captions via a mapping network and a frozen image encoder. The approach yields richer captions and translates into tangible gains for text-to-image generation and zero-shot classification/retrieval, while preserving transparency by grounding in alt-text rather than relying on opaque captioning pipelines. The results demonstrate that targeted augmentation with alt-text-grounded synthetic data can significantly enhance alignment, grounding, and downstream multimodal tasks, with practical implications for scalable, transparent image understanding systems.

Abstract

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

Altogether: Image Captioning via Re-aligning Alt-text

TL;DR

Altogether introduces a principled framework to improve image captions by re-aligning existing alt-text with image content. It combines multi-round human annotation of alt-text with a lightweight, scalable captioner that grounding alt-text into dense captions via a mapping network and a frozen image encoder. The approach yields richer captions and translates into tangible gains for text-to-image generation and zero-shot classification/retrieval, while preserving transparency by grounding in alt-text rather than relying on opaque captioning pipelines. The results demonstrate that targeted augmentation with alt-text-grounded synthetic data can significantly enhance alignment, grounding, and downstream multimodal tasks, with practical implications for scalable, transparent image understanding systems.

Abstract

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

Paper Structure

This paper contains 48 sections, 3 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: A Venn diagram illustrating caption quality improvement via multiple rounds of re-aligning previous captions (starting from alt-text) to the image.
  • Figure 2: Re-aligning alt-texts: Our captioner takes visual and alt-text input. We extract frozen CLIP image embeddings and transform it into a fixed number of visual tokens. Given alt-text, the decoder is able to ground this information, e.g. carrying concrete visual concepts, to generate a better caption that is aligned with the image.
  • Figure 3: Human evaluation on generated captions on better alignment / less hallucination ("which caption has the best alignment with the image and least hallucination"), specificity ("which caption contains more named entities") and usefulness of alt-text information ("which caption contain most useful information from alt-texts").
  • Figure 4: Zero-shot classification accuracy on ImageNet and averaged 26 CLIP tasks with different ratio of mixing synthetic captions during training of various CLIP ViT-B/32 models.
  • Figure 5: Annotation guideline.
  • ...and 1 more figures