Table of Contents
Fetching ...

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz Qureshi

TL;DR

This paper proposes a component-aware, self-refining framework for sketch-to-image generation that consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality.

Abstract

Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

TL;DR

This paper proposes a component-aware, self-refining framework for sketch-to-image generation that consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality.

Abstract

Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
Paper Structure (22 sections, 2 equations, 6 figures, 6 tables)

This paper contains 22 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of the proposed sketch to image generation architecture. (Top) Component-based Face Representation Learning. The self-attention mechanism is applied to each component of the sketch face to refine feature representations in the autoencoder. (Bottom) CGF based Adversarial Face Generation. The feature descriptors of each facial component are converted into feature maps, and these feature maps undergo the training process in AFIG module. The bidirectional arrow connecting the discriminator represents joint fine-tuning of the trained component encoders during the second training stage. Finally, the generated image passed through a SARR module to fine-tune the details and quality of the generated image using identity-preserving loss.
  • Figure 2: Qualitative comparison between our method and state-of-the-art approaches for sketch-to-image translation. Each column represents different sketch inputs and their corresponding outputs, generated by various methods.
  • Figure 3: Zero-shot comparison of DFD and Ground Truth (GT) with our method, trained on the CelebAMask-HQ dataset and tested on diverse sketch types.
  • Figure 4: Qualitative results for sketch-to-image translation on three non-facial datasets: Sketchy Database sangkloy2016sketchy, ShoesV2, and ChairsV2. Each block shows four rows: Sketches, Ground Truth (GT), Generated Images, and CycleGAN outputs.
  • Figure 5: The pipeline of the Image‑to‑Image comparison module built on FaceNet‑512.
  • ...and 1 more figures