Table of Contents
Fetching ...

How Do Inpainting Artifacts Propagate to Language?

Pratham Yashwante, Davit Abrahamyan, Shresth Grover, Sukruth Rao

TL;DR

A two-stage diagnostic setup is used in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs.

Abstract

We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.

How Do Inpainting Artifacts Propagate to Language?

TL;DR

A two-stage diagnostic setup is used in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs.

Abstract

We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.
Paper Structure (46 sections, 13 figures, 9 tables)

This paper contains 46 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Qualitative examples illustrating captioning errors induced by center-region inpainting. Incorrect semantic attributes introduced by inpainting are highlighted in red, while the correct interpretation is shown in green.
  • Figure 2: Degradation–reconstruction–captioning framework used to evaluate how inpainting artifacts propagate into downstream language outputs.
  • Figure 3: Masking examples illustrating center-region degradations on (A) Flickr and (B) RefCOCOg.
  • Figure 4: Correlations between reconstruction fidelity metrics and caption quality metrics on Flickr, RefCOCOg, and TRUCE. Points correspond to Stable Diffusion inpainting variants under three masking strategies: cm (hard center mask), gc (Gaussian-blurred center), and ld (low-dimensional center degradation). Caption quality is evaluated using BLIP for Flickr, and Qwen2.5-VL for RefCOCOg and TRUCE.
  • Figure 5: Layer-wise attention drift and entropy under inpainting on Flickr. Drift increases with depth and is higher for center-masked reconstructions.
  • ...and 8 more figures