Table of Contents
Fetching ...

Learning More by Seeing Less: Structure First Learning for Efficient, Transferable, and Human-Aligned Vision

Tianqin Li, George Liu, Tai Sing Lee

TL;DR

This paper introduces a structure-first learning paradigm that begins training with line drawings to bias models toward structural information, aiming for efficient, transferable, and human-aligned vision. By converting photographs to line drawings and augmenting with stylized sketches, the authors train two-stage curricula (Line→Color) that improve shape bias, attention focus, and data efficiency across classification, segmentation, and detection, while yielding compact representations that transfer well to lightweight models. The approach demonstrates consistent gains across CNN and transformer backbones, improves downstream task performance, and enhances distillation effectiveness, suggesting structure over texture as a robust inductive bias. Overall, the work provides a computational perspective on human-like perception and offers a practical strategy for building more robust, data-efficient vision systems.

Abstract

Despite remarkable progress in computer vision, modern recognition systems remain fundamentally limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings, suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose a novel structure-first learning paradigm that uses line drawings as an initial training modality to induce more compact and generalizable visual representations. We demonstrate that models trained with this approach develop a stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance, which mirrors observations of low-dimensional, efficient representations in the human brain. Beyond performance improvements, structure-first learning produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from teachers trained on line drawings consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases, offering a simple yet powerful strategy for building more robust and adaptable vision systems.

Learning More by Seeing Less: Structure First Learning for Efficient, Transferable, and Human-Aligned Vision

TL;DR

This paper introduces a structure-first learning paradigm that begins training with line drawings to bias models toward structural information, aiming for efficient, transferable, and human-aligned vision. By converting photographs to line drawings and augmenting with stylized sketches, the authors train two-stage curricula (Line→Color) that improve shape bias, attention focus, and data efficiency across classification, segmentation, and detection, while yielding compact representations that transfer well to lightweight models. The approach demonstrates consistent gains across CNN and transformer backbones, improves downstream task performance, and enhances distillation effectiveness, suggesting structure over texture as a robust inductive bias. Overall, the work provides a computational perspective on human-like perception and offers a practical strategy for building more robust, data-efficient vision systems.

Abstract

Despite remarkable progress in computer vision, modern recognition systems remain fundamentally limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings, suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose a novel structure-first learning paradigm that uses line drawings as an initial training modality to induce more compact and generalizable visual representations. We demonstrate that models trained with this approach develop a stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance, which mirrors observations of low-dimensional, efficient representations in the human brain. Beyond performance improvements, structure-first learning produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from teachers trained on line drawings consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases, offering a simple yet powerful strategy for building more robust and adaptable vision systems.

Paper Structure

This paper contains 24 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: A structure-first curriculum, beginning with line drawings, leads to efficient, transferable, and human-like visual representations.
  • Figure 2: Different augmentations of original image
  • Figure 3: ResNet-18 data efficiency results of training with various augmentations of STL10 and finetuning on the base STL10 data.
  • Figure 4: ViT-Tiny data efficiency results of training on STL10 with our Line$\rightarrow$Color curriculum vs Color only
  • Figure 5: GradCAM visualizations comparing attention heat maps for models trained on color images vs trained with our curriculum.
  • ...and 7 more figures