Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval
Aneeshan Sain, Pinaki Nath Chowdhury, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song
TL;DR
This work tackles FG-SBIR by addressing view-awareness: a sketch query can originate from a different viewpoint than the gallery target, which degrades standard single-view FG-SBIR. The authors propose a view-aware framework that (i) uses sketch-independent multi-view 2D projections of 3D objects to inject view semantics and (ii) develops a disentangled cross-modal encoder producing content $f_c$ and view $f_v$ features to support view-agnostic ($f_c$) and view-specific ($f_c+f_v$) retrieval with a unified model, guided by a final objective $L_{trn}$. Key contributions include the multi-view projection strategy, a cross-modal disentanglement approach enabling a simple view-switch, and comprehensive experiments on chairs and lamps showing improvements over baselines. The results demonstrate flexible, user-controllable retrieval in FG-SBIR and illustrate how 2D projections can sensitize cross-modal sketches to 3D view variation without full 3D representations, with potential impact on practical sketch-based search systems.
Abstract
In this paper, we delve into the intricate dynamics of Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) by addressing a critical yet overlooked aspect -- the choice of viewpoint during sketch creation. Unlike photo systems that seamlessly handle diverse views through extensive datasets, sketch systems, with limited data collected from fixed perspectives, face challenges. Our pilot study, employing a pre-trained FG-SBIR model, highlights the system's struggle when query-sketches differ in viewpoint from target instances. Interestingly, a questionnaire however shows users desire autonomy, with a significant percentage favouring view-specific retrieval. To reconcile this, we advocate for a view-aware system, seamlessly accommodating both view-agnostic and view-specific tasks. Overcoming dataset limitations, our first contribution leverages multi-view 2D projections of 3D objects, instilling cross-modal view awareness. The second contribution introduces a customisable cross-modal feature through disentanglement, allowing effortless mode switching. Extensive experiments on standard datasets validate the effectiveness of our method.
