Table of Contents
Fetching ...

CSHNet: A Novel Information Asymmetric Image Translation Method

Xi Yang, Haoyuan Shi, Zihan Wang, Nannan Wang, Xinbo Gao

TL;DR

CSHNet tackles information-asymmetric image translation by integrating CNN-driven detail with Swin Transformer-based structure in a novel SEC-CES-Bottleneck. The framework introduces Interactive Guided Connection to fuse low-level detail with high-level semantics and Adaptive Edge Perception Loss to preserve clear region boundaries. Empirical results on SEN12 and Sketch2Anime show state-of-the-art performance in structural fidelity and perceptual quality, with robust ablations validating the SCB design and the IGC/AEPL components. The work demonstrates that a carefully designed CNN–Transformer hybrid can outperform both pure CNN and pure Transformer approaches in cross-domain translation tasks relevant to remote sensing and multimedia domains.

Abstract

Despite advancements in cross-domain image translation, challenges persist in asymmetric tasks such as SAR-to-Optical and Sketch-to-Instance conversions, which involve transforming data from a less detailed domain into one with richer content. Traditional CNN-based methods are effective at capturing fine details but struggle with global structure, leading to unwanted merging of image regions. To address this, we propose the CNN-Swin Hybrid Network (CSHNet), which combines two key modules: Swin Embedded CNN (SEC) and CNN Embedded Swin (CES), forming the SEC-CES-Bottleneck (SCB). SEC leverages CNN's detailed feature extraction while integrating the Swin Transformer's structural bias. CES, in turn, preserves the Swin Transformer's global integrity, compensating for CNN's lack of focus on structure. Additionally, CSHNet includes two components designed to enhance cross-domain information retention: the Interactive Guided Connection (IGC), which enables dynamic information exchange between SEC and CES, and Adaptive Edge Perception Loss (AEPL), which maintains structural boundaries during translation. Experimental results show that CSHNet outperforms existing methods in both visual quality and performance metrics across scene-level and instance-level datasets. Our code is available at: https://github.com/XduShi/CSHNet.

CSHNet: A Novel Information Asymmetric Image Translation Method

TL;DR

CSHNet tackles information-asymmetric image translation by integrating CNN-driven detail with Swin Transformer-based structure in a novel SEC-CES-Bottleneck. The framework introduces Interactive Guided Connection to fuse low-level detail with high-level semantics and Adaptive Edge Perception Loss to preserve clear region boundaries. Empirical results on SEN12 and Sketch2Anime show state-of-the-art performance in structural fidelity and perceptual quality, with robust ablations validating the SCB design and the IGC/AEPL components. The work demonstrates that a carefully designed CNN–Transformer hybrid can outperform both pure CNN and pure Transformer approaches in cross-domain translation tasks relevant to remote sensing and multimedia domains.

Abstract

Despite advancements in cross-domain image translation, challenges persist in asymmetric tasks such as SAR-to-Optical and Sketch-to-Instance conversions, which involve transforming data from a less detailed domain into one with richer content. Traditional CNN-based methods are effective at capturing fine details but struggle with global structure, leading to unwanted merging of image regions. To address this, we propose the CNN-Swin Hybrid Network (CSHNet), which combines two key modules: Swin Embedded CNN (SEC) and CNN Embedded Swin (CES), forming the SEC-CES-Bottleneck (SCB). SEC leverages CNN's detailed feature extraction while integrating the Swin Transformer's structural bias. CES, in turn, preserves the Swin Transformer's global integrity, compensating for CNN's lack of focus on structure. Additionally, CSHNet includes two components designed to enhance cross-domain information retention: the Interactive Guided Connection (IGC), which enables dynamic information exchange between SEC and CES, and Adaptive Edge Perception Loss (AEPL), which maintains structural boundaries during translation. Experimental results show that CSHNet outperforms existing methods in both visual quality and performance metrics across scene-level and instance-level datasets. Our code is available at: https://github.com/XduShi/CSHNet.
Paper Structure (17 sections, 9 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 9 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: The network architectures adopted in existing methods and the associated challenges.red(a) Visualization of the network architectures for the methods described in (b). Detailed structural information is highlighted in the blue and orange boxes, representing the CNN-based bottleneck (GlobalG), Transformer-based bottleneck (SwinG), and the proposed SEC-CES-bottleneck. (b) Image-to-Image translation network based on Encoder-Bottleneck-Decoder paradigm. The proposed SCB has a similar hierarchical structure to GlobalG and SwinG but is not a simple stacking of RMs and SMs. It consists of SEC and CES cross-combination, considering both CNN and Transformer.
  • Figure 2: The proposed CSHNet framework for information asymmetric image translation. It contains three main components: SEC-CES-Bottleneck (SCB), Interactive Guided Connection (IGC) and Adaptive Edge Perception Loss (AEPL).
  • Figure 3: Flowchart of dynamic threshold-based AEPL.
  • Figure 4: Visualization results of ablation experiments with different components on SEN12 dataset.
  • Figure 5: Visualization results of the feature maps before and after the action of IGC. $x_i$ and $x_{CES}$ are the results before the optimization of parameters $a$ and $b$. $x_{i}^{\prime}$ and $x_{CES}^{\prime}$ are the feature maps after IGC.
  • ...and 3 more figures