Coupled Inference in Diffusion Models for Semantic Decomposition
Calvin Yeung, Ali Zakeri, Zhuowen Zou, Mohsen Imani
TL;DR
This work addresses semantic decomposition of binding-based scene representations by framing it as a coupled-inference inverse problem. It develops a diffusion-model framework that couples multiple factor priors through a reconstruction-driven energy, deriving analytic score functions and an iterative sampling scheme to jointly infer factor codewords. The method encompasses Gaussian and similarity energy variants in both codebook and latent spaces, and unifies diffusion posterior sampling with resonator-network concepts, showing superior decomposition accuracy and robustness on synthetic tasks. The results demonstrate that diffusion models can perform compositional reasoning and factor recovery, suggesting practical pathways for object-centric analysis and editing without retraining new priors.
Abstract
Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
