Table of Contents
Fetching ...

A$^\text{T}$A: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting

Yizhe Tang, Zhimin Sun, Yuzhen Du, Ran Yi, Guangben Lu, Teng Hu, Luying Li, Lizhuang Ma, Fangyuan Zou

TL;DR

The paper tackles the problem of harmoniously inpainting a background given a foreground subject and a text prompt, by allowing the subject to adapt its position and scale rather than remaining fixed. It introduces the Adaptive Transformation Agent (A$^\text{T}$A), which uses a multi-block PosAgent-based Reverse Displacement Transform to progressively shift hierarchical subject features from deep to shallow, guided by text and a position switch that toggles between variable and fixed positioning. A hybrid training strategy with a Position Switch Embedding enables end-to-end learning for both subject-position variable and fixed tasks. Empirical results on a large, diverse dataset show state-of-the-art performance in image quality, subject placement rationality, and text alignment, while maintaining good results in fixed-position inpainting. This work provides a flexible framework for text-guided background inpainting with controllable subject positioning, with potential impact on advertising, design, and media production where layout harmony is essential.

Abstract

Image inpainting aims to fill the missing region of an image. Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided. Existing background inpainting methods typically strictly preserve the subject's original position from the source image, resulting in inconsistencies between the subject and the generated background. To address this challenge, we propose a new task, the "Text-Guided Subject-Position Variable Background Inpainting", which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject and the inpainted background, and propose the Adaptive Transformation Agent (A$^\text{T}$A) for this task. Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position. Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information. Thirdly, we equip A$^\text{T}$A with a Position Switch Embedding to control whether the subject's position in the generated image is adaptively predicted or fixed. Extensive comparative experiments validate the effectiveness of our A$^\text{T}$A approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but also ensures good performance on subject-position fixed inpainting.

A$^\text{T}$A: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting

TL;DR

The paper tackles the problem of harmoniously inpainting a background given a foreground subject and a text prompt, by allowing the subject to adapt its position and scale rather than remaining fixed. It introduces the Adaptive Transformation Agent (AA), which uses a multi-block PosAgent-based Reverse Displacement Transform to progressively shift hierarchical subject features from deep to shallow, guided by text and a position switch that toggles between variable and fixed positioning. A hybrid training strategy with a Position Switch Embedding enables end-to-end learning for both subject-position variable and fixed tasks. Empirical results on a large, diverse dataset show state-of-the-art performance in image quality, subject placement rationality, and text alignment, while maintaining good results in fixed-position inpainting. This work provides a flexible framework for text-guided background inpainting with controllable subject positioning, with potential impact on advertising, design, and media production where layout harmony is essential.

Abstract

Image inpainting aims to fill the missing region of an image. Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided. Existing background inpainting methods typically strictly preserve the subject's original position from the source image, resulting in inconsistencies between the subject and the generated background. To address this challenge, we propose a new task, the "Text-Guided Subject-Position Variable Background Inpainting", which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject and the inpainted background, and propose the Adaptive Transformation Agent (AA) for this task. Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position. Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information. Thirdly, we equip AA with a Position Switch Embedding to control whether the subject's position in the generated image is adaptively predicted or fixed. Extensive comparative experiments validate the effectiveness of our AA approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but also ensures good performance on subject-position fixed inpainting.

Paper Structure

This paper contains 28 sections, 13 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: For foreground-conditioned background inpainting, (a) fixing the object position specified by the input image (left-top) may contradict the generated background; (b) while our model achieves subject-position variable background inpainting, adaptively determines a suitable location for the subject, and generates an image with a harmonious subject-background relationship.
  • Figure 2: Adaptive Transformation Agent (A$^\text{T}$A) comprises $4$ modules: Feature extraction, Reverse displacement transform, Feature fusion, and Diffusion denoising. We use Hunyuan-DiT li2024hunyuan as the base model, and mainly develop the subject feature extraction, displacement transformation prediction, and displaced feature injection mechanisms to achieve subject-position variable inpainting. We also design aposition switch embedding to control whether the position of the subject in the generated image is adaptively predicted or fixed.
  • Figure 3: Comparison of different structures for Displacement Transform. To transform the hierarchical feature maps from deep to shallow based on semantic information, we propose a novel Reverse Displacement Transform module.
  • Figure 4: Training samples and corresponding image conditions. From left to right: masked subject image $I_S$, mask $m$, depth map $d_S$, ground truth image $I$.
  • Figure 5: Comparison between our A$^\text{T}$A and the baseline methods. We highlight the unreasonable extension parts with orange boxes and the unreasonable layouts with purple boxes, and label the missing objects with correspondingcolors. Please zoom in for more details.
  • ...and 12 more figures