Table of Contents
Fetching ...

Match-and-Fuse: Consistent Generation from Unstructured Image Sets

Kate Feingold, Omri Kaduri, Tali Dekel

TL;DR

The paper addresses generating coherent edits from unstructured image sets by introducing Match-and-Fuse, a zero-shot, training-free framework that preserves cross-image consistency for shared content. It models the set as a Pairwise Consistency Graph and employs Multiview Feature Fusion guided by dense 2D correspondences to enforce global coherence across all pairwise grids, extending the grid prior without masks. Per-image prompts are automatically composed from set-level prompts, and a lightweight Feature Guidance term further aligns features across views. Extensive experiments demonstrate superior cross-image consistency and visual fidelity compared to strong baselines, with diverse applications including storyboard-style edits and flow-based localized adjustments. The work advances set-to-set generation and lays groundwork for scalable, consistent editing across image collections and potentially video collections.

Abstract

We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. It also allows us to leverage an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.

Match-and-Fuse: Consistent Generation from Unstructured Image Sets

TL;DR

The paper addresses generating coherent edits from unstructured image sets by introducing Match-and-Fuse, a zero-shot, training-free framework that preserves cross-image consistency for shared content. It models the set as a Pairwise Consistency Graph and employs Multiview Feature Fusion guided by dense 2D correspondences to enforce global coherence across all pairwise grids, extending the grid prior without masks. Per-image prompts are automatically composed from set-level prompts, and a lightweight Feature Guidance term further aligns features across views. Extensive experiments demonstrate superior cross-image consistency and visual fidelity compared to strong baselines, with diverse applications including storyboard-style edits and flow-based localized adjustments. The work advances set-to-set generation and lays groundwork for scalable, consistent editing across image collections and potentially video collections.

Abstract

We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. It also allows us to leverage an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.

Paper Structure

This paper contains 20 sections, 9 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Given a source set of images (top), depicting shared objects in varied settings (e.g., pose, environment, viewpoint), our method, Match-and-Fuse, jointly generates an output set in which the consistency among the shared content is preserved (bottom). The output adheres to the user-provided prompts that describe the target shared content ($\mathcal{P}^\textit{shared}$), and the scene's style/theme ($\mathcal{P}^\textit{theme}$).
  • Figure 2: Image grid generation vs. our method. (a) Source image set. (b) Joint generation of two-image grid results in partial consistency, where several regions often remain inconsistent in appearance or semantic meaning (e.g., a dog's face). (c) Extending this to more images further reduces consistency. (d) Our method leverages this prior yet overcomes its limitations of consistency and scale.
  • Figure 3: Match-and-Fuse pipeline. Example for 4 images. In pre-processing, pairwise matches are computed between all inputs, and per-image prompts are generated from the set-level prompts. At each denoising step, noisy image latents form a Pairwise Consistency Graph, whose edges $z^{t}_{ij}$ are jointly denoised with Multiview Feature Fusion (MFF) and aggregated back into per-image latents $z^{t-1}_i$ by averaging over adjacent edges. The latents are further refined with Feature Guidance via a feature-level matching objective.
  • Figure 4: MFF Denoising step. (a) Two-image grids on all edges are denoised with a frozen DiT. Selected blocks average K,V along adjacent edges into $\bar{\mathbf{f}}_i$, which are then fused via source matches (b). Images are fused jointly, illustrated by arrows for $i{=}1$.
  • Figure 5: Matched feature similarity vs. visual consistency. We consider increasingly consistent generation (left to right): random images $\to$ adding descriptive prompts $\to$ adding control signals $\to$ generating in a grid $\to$ DDIM song2020denoising inversion which reconstructs fully consistent source images. Keys and values differ in scale but follow the same pattern: cosine similarity at matched locations rises with consistency. Dashed lines show the baseline all-to-all similarity of feature maps. Points are averaged over 40 image pairs, all correspondences, blocks, and timesteps.
  • ...and 13 more figures