Table of Contents
Fetching ...

SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

Subhadeep Koley, Tapas Kumar Dutta, Aneeshan Sain, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Yi-Zhe Song

TL;DR

This work exposes fundamental limitations of Stable Diffusion for sketch understanding, notably its difficulty with abstract, sparse sketches and a pervasive high-frequency bias. It introduces SketchFusion, a hybrid framework that injects CLIP visual features into SD's denoising process and uses a dynamic aggregation network to fuse multi-layer SD features with semantic cues from CLIP. The approach, trained only on lightweight injection and aggregation modules while keeping SD and CLIP frozen, achieves state-of-the-art results across sketch-based image retrieval, recognition, segmentation, and sketch-photo correspondence, demonstrating a truly universal sketch feature representation. The work highlights the complementary strengths of foundation models and presents an adaptive, task-agnostic fusion strategy with broad practical impact for sketch-centric vision tasks.

Abstract

While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.

SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

TL;DR

This work exposes fundamental limitations of Stable Diffusion for sketch understanding, notably its difficulty with abstract, sparse sketches and a pervasive high-frequency bias. It introduces SketchFusion, a hybrid framework that injects CLIP visual features into SD's denoising process and uses a dynamic aggregation network to fuse multi-layer SD features with semantic cues from CLIP. The approach, trained only on lightweight injection and aggregation modules while keeping SD and CLIP frozen, achieves state-of-the-art results across sketch-based image retrieval, recognition, segmentation, and sketch-photo correspondence, demonstrating a truly universal sketch feature representation. The work highlights the complementary strengths of foundation models and presents an adaptive, task-agnostic fusion strategy with broad practical impact for sketch-centric vision tasks.

Abstract

While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.

Paper Structure

This paper contains 16 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Given the frozen SD rombach2022high and CLIP radford2021learning models, the proposed method learns the aggregation network, $1D$convolutional layers, and branch-weights with sketch-photo pairs, via different losses for different downstream tasks (details in \ref{['sec:exp']}).
  • Figure 2: Sketch-photo correspondence (left$\shortrightarrow$right : source $\shortrightarrow$ target) results on PSC6K. Green circles and squares depict source and GT points respectively, while red squares denote predicted points.
  • Figure 3: Qualitative results for sketch-based image segmentation. Given a query sketch, our method generates separate segmentation masks for all images of that category. (Zoom-in for the best view.)
  • Figure 4: Choice of timestep ($t$).