Table of Contents
Fetching ...

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty

TL;DR

The paper addresses the gap in sketch understanding within open-weight LVLMs by introducing SketchVCL, a large-scale image–sketch–instruction dataset, and O3SLM, a unified LVLM trained on this data. It employs a two-stage training regime (Sketch Alignment followed by Instruction Tuning) and a SketchMIX data pool to enable robust sketch–image–text reasoning across counting, localization, SBIR, and VQA. Empirical results show state-of-the-art performance among open-weight models on sketch-based tasks, with notable generalization to unseen sketch styles and tasks, as well as competitive image-only performance. This work advances open, sketch-aware multimodal understanding, offering a scalable path toward broader accessibility and application of LVLMs in sketch-centric reasoning tasks.

Abstract

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

TL;DR

The paper addresses the gap in sketch understanding within open-weight LVLMs by introducing SketchVCL, a large-scale image–sketch–instruction dataset, and O3SLM, a unified LVLM trained on this data. It employs a two-stage training regime (Sketch Alignment followed by Instruction Tuning) and a SketchMIX data pool to enable robust sketch–image–text reasoning across counting, localization, SBIR, and VQA. Empirical results show state-of-the-art performance among open-weight models on sketch-based tasks, with notable generalization to unseen sketch styles and tasks, as well as competitive image-only performance. This work advances open, sketch-aware multimodal understanding, offering a scalable path toward broader accessibility and application of LVLMs in sketch-centric reasoning tasks.

Abstract

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

Paper Structure

This paper contains 38 sections, 1 equation, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Limitations of LVLMs in Sketch Understanding. Although current LVLMs can interpret sketches to some level of abstraction, they struggle in sketch understanding for downstream tasks like detection and reasoning.
  • Figure 2: Capabilities of our model - O3SLM. Our model is the first Large Vision-Language Model (LVLM) to demonstrate advanced alignment between sketches, images, and text—where existing LVLMs consistently fail (see Table \ref{['tab:benchmark_results_count']}). Through extensive pretraining on our proposed SketchVCL dataset, the model develops a robust understanding of crude hand-drawn sketches and how they relate to the visual and textual modalities in which current LVLMs already excel. This training enables cross-modal transfer, allowing the model to handle fine-grained queries using sketch-text pairs, even though it was originally trained with sketches alone. O3SLM is trained across multiple tasks, including Visual Question Answering (VQA), Sketch-Based Image Retrieval (SBIR), sketch-based counting, and sketch-based object detection.
  • Figure 3: Automated Large-Scale Sketch Generation Pipeline. For each object instance, we use the SAM2-generated segmentation maps to mask the background and pass the foreground through Pix2Pix li2019photo for sketch generation. These sketches are enhanced using edge detection using morphological gradients. The final sketch is an aggregation of the edges and the Pix2Pix sketch.
  • Figure 4: Summary of O3SLM. We use CLIP-L-336 as the visual backbone. The hand-drawn sketch and natural image are encoded using this backbone, then the multimodal connector projects the sketch and image features to the input space of the LLM. Finally, the sketch, image, and text tokens are concatenated and passed through the LLM.
  • Figure 5: Effect of Pretraining. We assess the impact of our large-scale pretraining stage on two tasks. SBIR tasks significantly benefit from pretraining (Right), whereas the effect on counting is minimal (Left).
  • ...and 12 more figures