Table of Contents
Fetching ...

Coupled Inference in Diffusion Models for Semantic Decomposition

Calvin Yeung, Ali Zakeri, Zhuowen Zou, Mohsen Imani

TL;DR

This work addresses semantic decomposition of binding-based scene representations by framing it as a coupled-inference inverse problem. It develops a diffusion-model framework that couples multiple factor priors through a reconstruction-driven energy, deriving analytic score functions and an iterative sampling scheme to jointly infer factor codewords. The method encompasses Gaussian and similarity energy variants in both codebook and latent spaces, and unifies diffusion posterior sampling with resonator-network concepts, showing superior decomposition accuracy and robustness on synthetic tasks. The results demonstrate that diffusion models can perform compositional reasoning and factor recovery, suggesting practical pathways for object-centric analysis and editing without retraining new priors.

Abstract

Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.

Coupled Inference in Diffusion Models for Semantic Decomposition

TL;DR

This work addresses semantic decomposition of binding-based scene representations by framing it as a coupled-inference inverse problem. It develops a diffusion-model framework that couples multiple factor priors through a reconstruction-driven energy, deriving analytic score functions and an iterative sampling scheme to jointly infer factor codewords. The method encompasses Gaussian and similarity energy variants in both codebook and latent spaces, and unifies diffusion posterior sampling with resonator-network concepts, showing superior decomposition accuracy and robustness on synthetic tasks. The results demonstrate that diffusion models can perform compositional reasoning and factor recovery, suggesting practical pathways for object-centric analysis and editing without retraining new priors.

Abstract

Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
Paper Structure (51 sections, 31 equations, 6 figures, 1 table)

This paper contains 51 sections, 31 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the semantic decomposition task and methods. Left: Images are passed through a neural network to generate binding-based compositional representations. We would like to identify each constituent using the codebooks that encode their possible values. Right: Coupled diffusion processes for semantic decomposition. Diffusion processes corresponding to each factor operates in parallel and are jointly guided to converge towards the correct factorization.
  • Figure 2: Decomposition accuracy for varying search space sizes. Codebook vectors have dimension $D=1000$ and models are run for 100 iterations.
  • Figure 3: Decomposition accuracy as codebook vector dimension $D$ varies, for $K=3,n=50$.
  • Figure 4: Decomposition accuracy when varying restart ratio $\rho$ and number of restarts $R$ for $D=1000$, $K=3$, and $n=40$.
  • Figure 5: Decomposition accuracy when varying the number of discretized diffusion steps for $D=1000$, $K=3$, and $n=40$.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Remark 1