Table of Contents
Fetching ...

FROMAT: Multiview Material Appearance Transfer via Few-Shot Self-Attention Adaptation

Hubert Kompanowski, Varun Jampani, Aaryaman Vasishta, Binh-Son Hua

TL;DR

The paper tackles the challenge of controlling appearance in multiview diffusion models without compromising geometry or view coherence. It introduces FROMAT, a lightweight, three-stream self-attention adaptation that decouples object identity from appearance and learns per-layer mixing to transfer materials across views with few-shot training. The approach achieves state-of-the-art results in material appearance transfer, maintains multiview consistency, and is compatible with multiple backbones like SEVA and Era3D. This work enables practical, view-consistent implicit 3D editing and paves the way for more flexible appearance manipulation in diffusion-based 3D content pipelines.

Abstract

Multiview diffusion models have rapidly emerged as a powerful tool for content creation with spatial consistency across viewpoints, offering rich visual realism without requiring explicit geometry and appearance representation. However, compared to meshes or radiance fields, existing multiview diffusion models offer limited appearance manipulation, particularly in terms of material, texture, or style. In this paper, we present a lightweight adaptation technique for appearance transfer in multiview diffusion models. Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output that reflects the desired materials, textures, or styles. This allows explicit specification of appearance parameters at generation time while preserving the underlying object geometry and view coherence. We leverage three diffusion denoising processes responsible for generating the original object, the reference, and the target images, and perform reverse sampling to aggregate a small subset of layer-wise self-attention features from the object and the reference to influence the target generation. Our method requires only a few training examples to introduce appearance awareness to pretrained multiview models. The experiments show that our method provides a simple yet effective way toward multiview generation with diverse appearance, advocating the adoption of implicit generative 3D representations in practice.

FROMAT: Multiview Material Appearance Transfer via Few-Shot Self-Attention Adaptation

TL;DR

The paper tackles the challenge of controlling appearance in multiview diffusion models without compromising geometry or view coherence. It introduces FROMAT, a lightweight, three-stream self-attention adaptation that decouples object identity from appearance and learns per-layer mixing to transfer materials across views with few-shot training. The approach achieves state-of-the-art results in material appearance transfer, maintains multiview consistency, and is compatible with multiple backbones like SEVA and Era3D. This work enables practical, view-consistent implicit 3D editing and paves the way for more flexible appearance manipulation in diffusion-based 3D content pipelines.

Abstract

Multiview diffusion models have rapidly emerged as a powerful tool for content creation with spatial consistency across viewpoints, offering rich visual realism without requiring explicit geometry and appearance representation. However, compared to meshes or radiance fields, existing multiview diffusion models offer limited appearance manipulation, particularly in terms of material, texture, or style. In this paper, we present a lightweight adaptation technique for appearance transfer in multiview diffusion models. Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output that reflects the desired materials, textures, or styles. This allows explicit specification of appearance parameters at generation time while preserving the underlying object geometry and view coherence. We leverage three diffusion denoising processes responsible for generating the original object, the reference, and the target images, and perform reverse sampling to aggregate a small subset of layer-wise self-attention features from the object and the reference to influence the target generation. Our method requires only a few training examples to introduce appearance awareness to pretrained multiview models. The experiments show that our method provides a simple yet effective way toward multiview generation with diverse appearance, advocating the adoption of implicit generative 3D representations in practice.

Paper Structure

This paper contains 27 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Multiview material appearance transfer from a single input image (leftmost, the dragon head) and a reference appearance image. Please see the videos for all views.
  • Figure 2: Overview of our method. a) Attention mixing mechanism. b) Training data generation pipeline. c) Three-stream denoising framework. We introduce Attention Mixing, each with its own mixing weights, for every self-attention block in the main stream denoising network. The mixing weights are optimized based on a few 3D objects rendered to multiview images. At inference, the main stream can then perform multiview appearance transfer from arbitrary pairs of object-reference images.
  • Figure 3: Qualitative comparison of our method with baseline on real photo inputs and generated images inputs. Our method successfully preserves object identity and achieves plausible material application, while the baselines struggle with lifting modified input image to multiview.
  • Figure 4: Additional visual results on rendered image. Our method successfully preserves object identity, keeps the details and achieves plausible material application.
  • Figure 5: Data efficiency: our method needs as little as one training data sample to learn appearance control.
  • ...and 7 more figures