Table of Contents
Fetching ...

Masked Diffusion Captioning for Visual Feature Learning

Chao Feng, Zihao Wei, Andrew Owens

TL;DR

This work tackles learning visual representations from image-caption pairs without relying on autoregressive sequence conditioning. It introduces masked diffusion captioning (MDC), an image-conditioned masked diffusion language model that masks text tokens with a time-based schedule and trains a decoder conditioned on visual features to reconstruct captions. The approach yields competitive visual representations in linear probing against autoregressive and contrastive baselines, demonstrates reasonable captioning capability, and shows strong vision-language compositionality. The results suggest masked diffusion language models are a viable alternative to autoregressive captioning for visual feature learning and scalable to larger caption collections.

Abstract

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

Masked Diffusion Captioning for Visual Feature Learning

TL;DR

This work tackles learning visual representations from image-caption pairs without relying on autoregressive sequence conditioning. It introduces masked diffusion captioning (MDC), an image-conditioned masked diffusion language model that masks text tokens with a time-based schedule and trains a decoder conditioned on visual features to reconstruct captions. The approach yields competitive visual representations in linear probing against autoregressive and contrastive baselines, demonstrates reasonable captioning capability, and shows strong vision-language compositionality. The results suggest masked diffusion language models are a viable alternative to autoregressive captioning for visual feature learning and scalable to larger caption collections.

Abstract

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

Paper Structure

This paper contains 38 sections, 9 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Learning visual features by masked diffusion language modeling. We learn visual features by captioning images using an image-conditioned masked diffusion language model. After training, features from the visual encoder can be transferred to downstream computer vision tasks.
  • Figure 2: Learning visual features using masked diffusion captioning. (a) We train an image-conditioned masked diffusion language model to learn visual features. Given an image and its corresponding text caption, we randomly mask text tokens in the caption. We then reconstruct the caption, using a decoder that is conditioned on visual features (obtained from a separate encoder network) and the text tokens. In each training iteration, we sample a time step $t$ that determines a masking ratio and a cross-entropy weight. $T=0$ means no masked token while $T=1$ means sequence is fully masked. (b) During sampling, we start with a fully masked sequence containing $N'$ mask tokens. We then iteratively denoise $N'$ steps to obtain a full caption.
  • Figure 3: Dataset caption length distribution. We visualize caption length distribution for CC3M sharma2018conceptual, CC12M changpinyo2021conceptual, and a 10M randomly sampled subset of Recap-DataComp li2024recaption after tokenization.
  • Figure 4: Comparison to image-conditioned BERT with different masking ratios. We compare our method against BERT with varying masking ratios, including 100$\%$ (parallel decoding). While BERT with certain masking ratios achieves performance close to ours, our method adopts a unified schedule, avoiding the need to tune the masking ratio on each dataset.
  • Figure 5: Linear probing performance with varying numbers of image–text pairs. We randomly sample 5M, 10M, 20M, and 30M pairs from Recap-DataComp-1B li2024recaption for pretraining our method. As the number of image–text pairs increases, the linear probing performance on IN-1K improves.
  • ...and 2 more figures