Table of Contents
Fetching ...

Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Longtao Jiang, Jie Huang, Mingfei Han, Lei Chen, Yongqiang Yu, Feng Zhao, Xiaojun Chang, Zhihui Li

TL;DR

Token Painter tackles text-guided image inpainting by leveraging Mask AutoRegressive (MAR) models to preserve background while respecting prompt details. It introduces two training-free modules: Dual-Stream Encoder Information Fusion (DEIF) to fuse semantic and contextual cues in the encoder, and Adaptive Decoder Attention Score Enhancing (ADAE) to modulate attention during decoding, producing novel guidance tokens $T_{gf}$. Built on the NOVA MAR backbone, Token Painter achieves state-of-the-art results across EditBench and BrushBench metrics, without fine-tuning on inpainting datasets. The approach significantly improves prompt alignment and background consistency, highlighting the potential of AR-based models for controllable, high-quality inpainting.

Abstract

Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics. Codes: https://github.com/longtaojiang/Token-Painter.

Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

TL;DR

Token Painter tackles text-guided image inpainting by leveraging Mask AutoRegressive (MAR) models to preserve background while respecting prompt details. It introduces two training-free modules: Dual-Stream Encoder Information Fusion (DEIF) to fuse semantic and contextual cues in the encoder, and Adaptive Decoder Attention Score Enhancing (ADAE) to modulate attention during decoding, producing novel guidance tokens . Built on the NOVA MAR backbone, Token Painter achieves state-of-the-art results across EditBench and BrushBench metrics, without fine-tuning on inpainting datasets. The approach significantly improves prompt alignment and background consistency, highlighting the potential of AR-based models for controllable, high-quality inpainting.

Abstract

Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics. Codes: https://github.com/longtaojiang/Token-Painter.

Paper Structure

This paper contains 16 sections, 11 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of the inpainting process of diffusion models (above) and our MAR-based method (below).
  • Figure 2: Comparison of vallia T&B and T-only approaches.
  • Figure 3: Overview of Token Painter, which includes the DEIF at encoder stage and the ADAE at decoder stage. DEIF produces novel guidance tokens $T_{gf}$ that contain both text and context information through information fusion in frequency domain. ADAE enhances two parts of attention map $A$ to further improve prompt detail alignment and content visual quality.
  • Figure 4: Qualitative results of our Token Painter with previous text-guided inpainting methods. The first three rows of samples are from EditBench with loose masks, and the last three rows of samples are from BrushBench with tight masks.
  • Figure 5: Visualization of effects of each component. From left to right, we progressively add each proposed component.
  • ...and 1 more figures