Table of Contents
Fetching ...

TopoDiffusionNet: A Topology-aware Diffusion Model

Saumya Gupta, Dimitris Samaras, Chao Chen

TL;DR

This work addresses the challenge that diffusion models often fail to enforce specific image topology, as captured by Betti numbers $\beta_0$ and $\beta_1$. It introduces TopoDiffusionNet (TDN), which couples diffusion-based denoising with a topology-aware loss derived from persistent homology to preserve a target topology while generating masks that guide downstream rendering. The method defines $\mathcal{L}_{\text{top}} = \mathcal{L}_{\text{preserve}} + \mathcal{L}_{\text{denoise}}$, operating on the predicted noiseless image $\hat{x}_0^t$ via persistence diagrams $\mathcal{D}(\hat{x}_0^t)$ to preserve the top $c$ salient structures and suppress extras, and trains end-to-end with $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{simple}} + \lambda\mathcal{L}_{\text{top}}$. Experiments on four datasets demonstrate large improvements in topological fidelity for both 0- and 1-dimensional constraints, validating the effectiveness of integrating topology with diffusion models and suggesting new directions for topology-aware generative control in real-world applications.

Abstract

Diffusion models excel at creating visually impressive images but often struggle to generate images with a specified topology. The Betti number, which represents the number of structures in an image, is a fundamental measure in topology. Yet, diffusion models fail to satisfy even this basic constraint. This limitation restricts their utility in applications requiring exact control, like robotics and environmental modeling. To address this, we propose TopoDiffusionNet (TDN), a novel approach that enforces diffusion models to maintain the desired topology. We leverage tools from topological data analysis, particularly persistent homology, to extract the topological structures within an image. We then design a topology-based objective function to guide the denoising process, preserving intended structures while suppressing noisy ones. Our experiments across four datasets demonstrate significant improvements in topological accuracy. TDN is the first to integrate topology with diffusion models, opening new avenues of research in this area. Code available at https://github.com/Saumya-Gupta-26/TopoDiffusionNet

TopoDiffusionNet: A Topology-aware Diffusion Model

TL;DR

This work addresses the challenge that diffusion models often fail to enforce specific image topology, as captured by Betti numbers and . It introduces TopoDiffusionNet (TDN), which couples diffusion-based denoising with a topology-aware loss derived from persistent homology to preserve a target topology while generating masks that guide downstream rendering. The method defines , operating on the predicted noiseless image via persistence diagrams to preserve the top salient structures and suppress extras, and trains end-to-end with . Experiments on four datasets demonstrate large improvements in topological fidelity for both 0- and 1-dimensional constraints, validating the effectiveness of integrating topology with diffusion models and suggesting new directions for topology-aware generative control in real-world applications.

Abstract

Diffusion models excel at creating visually impressive images but often struggle to generate images with a specified topology. The Betti number, which represents the number of structures in an image, is a fundamental measure in topology. Yet, diffusion models fail to satisfy even this basic constraint. This limitation restricts their utility in applications requiring exact control, like robotics and environmental modeling. To address this, we propose TopoDiffusionNet (TDN), a novel approach that enforces diffusion models to maintain the desired topology. We leverage tools from topological data analysis, particularly persistent homology, to extract the topological structures within an image. We then design a topology-based objective function to guide the denoising process, preserving intended structures while suppressing noisy ones. Our experiments across four datasets demonstrate significant improvements in topological accuracy. TDN is the first to integrate topology with diffusion models, opening new avenues of research in this area. Code available at https://github.com/Saumya-Gupta-26/TopoDiffusionNet

Paper Structure

This paper contains 10 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of existing diffusion models in preserving topological constraints. Top row: 0-dim topological constraint to generate exactly five giraffes. Bottom row: 1-dim topological constraint to generate a road layout with exactly four distinct regions. Text-to-image methods like (a) Stable Diffusion rombach2022high and (b) DALL·E 3 dalle3 struggle to respect both 0-dim and 1-dim constraints. (c) Attention Refocusing (AR) phung2024grounded requires bounding boxes for each object but struggles with higher object counts and often creates partitioned images. (d)-(e) shows a two-step process: mask generation followed by ControlNet zhang2023adding rendering. (d) ADM-T generates masks by fine-tuning ADM dhariwal2021diffusion with the topological constraint as a condition, but this alone is insufficient. (e) Our TopoDiffusionNet, trained with a topology-based objective function, generates masks with the precise number of objects or regions, which when fed to ControlNet generates the desired image of five giraffes (top row) and four regions (bottom row). Giraffe/region counts are noted in the bottom-right inset of each image.
  • Figure 2: Illustration of topological structures.
  • Figure 3: (a) TDN overview: We condition the diffusion model on the topological constraint $c$ (here $c=2$). During training, we first add noise $\epsilon$ to the input $x_0$ using the forward process (\ref{['eq:rforward']}) to obtain $x_t$, where $t$ is sampled uniformly. The U-Net is trained as part of the reverse process, predicting the added noise $\epsilon_{\theta}(x_t, c, t)$, with which we obtain the noiseless image $\hat{x}_0^t$ (\ref{['eq:x0approx']}). Alongside the standard loss $\mathcal{L}_{\text{simple}}$, we propose $\mathcal{L}_{\text{top}}$ to enforce the topological integrity of $\hat{x}_0^t$. (b) To compute $\mathcal{L}_{\text{top}}$, the persistence diagram $\mathcal{D}(\hat{x}_0^t)$ captures all the topological structures in $\hat{x}_0^t$, partitioning them into salient/desired structures $\mathcal{P}$ and noisy ones $\mathcal{Q}$. Terms $\mathcal{L}_{\text{preserve}}$ and $\mathcal{L}_{\text{denoise}}$ respectively amplify $\mathcal{P}$ and suppress $\mathcal{Q}$, guiding the training to eventually satisfy $c$.
  • Figure 4: Illustration of persistent homology and persistence diagrams of 0-dim topological structures (connected components). Despite the noise, we can visually see three prominent structures $\alpha_1, \alpha_2, \alpha_3$ in $I$. In the topological space, $\alpha_1, \alpha_2, \alpha_3$ thus appear in the top-left corner of the persistence diagram $\mathcal{D}(I)$, persisting through most of the filtration $\mathcal{S}$. All the remaining connected components are noisy, persisting over a short threshold in $\mathcal{S}$, thus appearing closer to the diagonal in $\mathcal{D}(I)$. Persistence diagrams are useful to distinguish between salient and noisy structures in an image.
  • Figure 5: Illustration of $\mathcal{L_{\text{preserve}}}$ and $\mathcal{L_{\text{denoise}}}$ for 0-dim connected components, with $c = 2$ as seen in $x_0$. After computing $\mathcal{D}(\hat{x}_0^t)$, we partition it into sets $\mathcal{P}$ (the top $c$ structures) and $\mathcal{Q}$ (remaining ones). For each dot $p \in \mathcal{D}(\hat{x}_0^t)$, the birth and death values respectively correspond to local maxima $m_p$ and saddles $s_p$ in $\hat{x}_0^t$. In the terrain view of $\hat{x}_0^t$, structures $(m_1,s_1)$ and $(m_2, s_2)$ belong to $\mathcal{P}$; hence optimizing $\mathcal{L_{\text{preserve}}}$ increases their saliency by increasing $\hat{x}_0^t(m_1), \hat{x}_0^t(m_2)$ and decreasing $\hat{x}_0^t(s_1), \hat{x}_0^t(s_2)$. All the remaining $n$ structures $(m_3, s_3), (m_4, s_4), \cdots, (m_n, s_n)$ belong to $\mathcal{Q}$. Optimizing $\mathcal{L_{\text{denoise}}}$ suppresses these noisy structures by decreasing $\hat{x}_0^t(m_3), \hat{x}_0^t(m_4), \cdots, \hat{x}_0^t(m_n)$ and increasing $\hat{x}_0^t(s_3), \hat{x}_0^t(s_4), \cdots, \hat{x}_0^t(s_n)$.
  • ...and 3 more figures