Table of Contents
Fetching ...

Patch-wise Graph Contrastive Learning for Image Translation

Chanyong Jung, Gihyun Kwon, Jong Chul Ye

TL;DR

This paper addresses the challenge of semantically faithful image translation by introducing a patch-wise graph contrastive learning framework. It constructs patch graphs from a pretrained encoder, uses a shared adjacency matrix to couple input and translated output graphs, and applies graph pooling to capture hierarchical semantics, all while maximizing mutual information between patch nodes via an infoNCE loss. The approach yields state-of-the-art results on five unpaired translation benchmarks and demonstrates robust qualitative improvements in preserving structure and spatial coherence, including single-image high-resolution translations. By explicitly modeling patch topology and focusing on task-relevant regions, the method offers a principled way to leverage topology-aware representations for image translation with practical impact in semantically consistent generation.

Abstract

Recently, patch-wise contrastive learning is drawing attention for the image translation by exploring the semantic correspondence between the input and output images. To further explore the patch-wise topology for high-level semantic understanding, here we exploit the graph neural network to capture the topology-aware features. Specifically, we construct the graph based on the patch-wise similarity from a pretrained encoder, whose adjacency matrix is shared to enhance the consistency of patch-wise relation between the input and the output. Then, we obtain the node feature from the graph neural network, and enhance the correspondence between the nodes by increasing mutual information using the contrastive loss. In order to capture the hierarchical semantic structure, we further propose the graph pooling. Experimental results demonstrate the state-of-art results for the image translation thanks to the semantic encoding by the constructed graphs.

Patch-wise Graph Contrastive Learning for Image Translation

TL;DR

This paper addresses the challenge of semantically faithful image translation by introducing a patch-wise graph contrastive learning framework. It constructs patch graphs from a pretrained encoder, uses a shared adjacency matrix to couple input and translated output graphs, and applies graph pooling to capture hierarchical semantics, all while maximizing mutual information between patch nodes via an infoNCE loss. The approach yields state-of-the-art results on five unpaired translation benchmarks and demonstrates robust qualitative improvements in preserving structure and spatial coherence, including single-image high-resolution translations. By explicitly modeling patch topology and focusing on task-relevant regions, the method offers a principled way to leverage topology-aware representations for image translation with practical impact in semantically consistent generation.

Abstract

Recently, patch-wise contrastive learning is drawing attention for the image translation by exploring the semantic correspondence between the input and output images. To further explore the patch-wise topology for high-level semantic understanding, here we exploit the graph neural network to capture the topology-aware features. Specifically, we construct the graph based on the patch-wise similarity from a pretrained encoder, whose adjacency matrix is shared to enhance the consistency of patch-wise relation between the input and the output. Then, we obtain the node feature from the graph neural network, and enhance the correspondence between the nodes by increasing mutual information using the contrastive loss. In order to capture the hierarchical semantic structure, we further propose the graph pooling. Experimental results demonstrate the state-of-art results for the image translation thanks to the semantic encoding by the constructed graphs.
Paper Structure (20 sections, 9 equations, 10 figures, 2 tables)

This paper contains 20 sections, 9 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The semantic connectivity of input is extracted by the encoder, and shared to construct the graph network. We maximize the mutual information between the nodes.
  • Figure 2: (a) Overall framework of the proposed method. We impose patch-wise regularization by the GNN constructed by the encoder $E$. We extract the node feature $Z, V$ and maximize $I(Z; V)$. Pooled graphs are utilized to focus on task-relevant nodes. (b) The motivation of the proposed approach to use patch-wise connection of input image as the prior knowledge.
  • Figure 3: The construction of graphs $g_o, g_i$ with shared adjacency matrix $A$. Each graph extracts $l$-hop features $Z, V$ from the given node $F_i, F_o$.
  • Figure 4: The top-$K$ graph pooling graphUnet. The pooling vector $p$ provides the focused view of the graph for the given task. The final node feature is also weighted by $p$.
  • Figure 5: Top-$K$ graph pooling allocates higher weights to the informative nodes, similarly to the attention mechanism. (a) Top-$K$ graph pooling. (b) Attention method.
  • ...and 5 more figures