Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild
Siyoon Jin, Jisu Nam, Jiyoung Kim, Dahyun Chung, Yeong-Seok Kim, Joonhyung Park, Heonjeong Chu, Seungryong Kim
TL;DR
This work tackles exemplar-based semantic image synthesis in complex, real-world scenes by introducing AM-Adapter, a learning-based module that augments self-attention with semantic-informed cross-image matching. It combines a dual-branch architecture (Appearance Net and Structure Net) with a categorical matching cost derived from segmentation maps and a 4D cost aggregation to robustly align exemplar appearance to target structure. A stage-wise training regime, plus an automated exemplar retrieval mechanism, keeps the diffusion backbone fixed while enriching it with precise local appearance transfer, yielding state-of-the-art results in semantic alignment and appearance fidelity across driving and indoor scenes. The approach enables controllable one-to-one transfer and segmentation-guided editing with strong generalization and practical applicability, as evidenced by extensive ablations and quantitative/user studies.
Abstract
Exemplar-based semantic image synthesis generates images aligned with semantic content while preserving the appearance of an exemplar. Conventional structure-guidance models like ControlNet, are limited as they rely solely on text prompts to control appearance and cannot utilize exemplar images as input. Recent tuning-free approaches address this by transferring local appearance via implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, prior works are often restricted to single-object cases or foreground object appearance transfer, struggling with complex scenes involving multiple objects. To overcome this, we propose AM-Adapter (Appearance Matching Adapter) to address exemplar-based semantic image synthesis in-the-wild, enabling multi-object appearance transfer from a single scene-level image. AM-Adapter automatically transfers local appearances from the scene-level input. AM-Adapter alternatively provides controllability to map user-defined object details to specific locations in the synthesized images. Our learnable framework enhances cross-image matching within augmented self-attention by integrating semantic information from segmentation maps. To disentangle generation and matching, we adopt stage-wise training. We first train the structure-guidance and generation networks, followed by training the matching adapter while keeping the others frozen. During inference, we introduce an automated exemplar retrieval method for selecting exemplar image-segmentation pairs efficiently. Despite utilizing minimal learnable parameters, AM-Adapter achieves state-of-the-art performance, excelling in both semantic alignment and local appearance fidelity. Extensive ablations validate our design choices. Code and weights will be released.: https://cvlab-kaist.github.io/AM-Adapter/
