Masked Diffusion Captioning for Visual Feature Learning
Chao Feng, Zihao Wei, Andrew Owens
TL;DR
This work tackles learning visual representations from image-caption pairs without relying on autoregressive sequence conditioning. It introduces masked diffusion captioning (MDC), an image-conditioned masked diffusion language model that masks text tokens with a time-based schedule and trains a decoder conditioned on visual features to reconstruct captions. The approach yields competitive visual representations in linear probing against autoregressive and contrastive baselines, demonstrates reasonable captioning capability, and shows strong vision-language compositionality. The results suggest masked diffusion language models are a viable alternative to autoregressive captioning for visual feature learning and scalable to larger caption collections.
Abstract
We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
