Table of Contents
Fetching ...

Learning to Infer Generative Template Programs for Visual Concepts

R. Kenny Jones, Siddhartha Chaudhuri, Daniel Ritchie

TL;DR

This work tackles the challenge of learning flexible, general-purpose visual concepts without domain-specific priors by introducing Template Programs, a neurosymbolic framework that encodes concepts as DSL-based partial programs with HOLEs. A trio of inference networks—TemplateNet, ExpansionNet, and ParamNet—together infer and instantiate Template Programs from groups of visual inputs, guided by a two-stage learning protocol of synthetic pretraining and bootstrapped finetuning with self-supervised objectives. The approach is validated across three visual domains (2D layouts, Omniglot, and 3D shapes), demonstrating improved performance over domain-general baselines and competitive results against domain-specific methods for few-shot generation and co-segmentation, while enabling unconditional concept generation. The study highlights the framework's robustness to out-of-distribution inputs, the importance of bootstrapped finetuning and HOLE-based expansions, and outlines avenues for extending relational expressivity and handling variable input group sizes, with implications for broad domain-general concept learning.

Abstract

People grasp flexible visual concepts from a few examples. We explore a neurosymbolic system that learns how to infer programs that capture visual concepts in a domain-general fashion. We introduce Template Programs: programmatic expressions from a domain-specific language that specify structural and parametric patterns common to an input concept. Our framework supports multiple concept-related tasks, including few-shot generation and co-segmentation through parsing. We develop a learning paradigm that allows us to train networks that infer Template Programs directly from visual datasets that contain concept groupings. We run experiments across multiple visual domains: 2D layouts, Omniglot characters, and 3D shapes. We find that our method outperforms task-specific alternatives, and performs competitively against domain-specific approaches for the limited domains where they exist.

Learning to Infer Generative Template Programs for Visual Concepts

TL;DR

This work tackles the challenge of learning flexible, general-purpose visual concepts without domain-specific priors by introducing Template Programs, a neurosymbolic framework that encodes concepts as DSL-based partial programs with HOLEs. A trio of inference networks—TemplateNet, ExpansionNet, and ParamNet—together infer and instantiate Template Programs from groups of visual inputs, guided by a two-stage learning protocol of synthetic pretraining and bootstrapped finetuning with self-supervised objectives. The approach is validated across three visual domains (2D layouts, Omniglot, and 3D shapes), demonstrating improved performance over domain-general baselines and competitive results against domain-specific methods for few-shot generation and co-segmentation, while enabling unconditional concept generation. The study highlights the framework's robustness to out-of-distribution inputs, the importance of bootstrapped finetuning and HOLE-based expansions, and outlines avenues for extending relational expressivity and handling variable input group sizes, with implications for broad domain-general concept learning.

Abstract

People grasp flexible visual concepts from a few examples. We explore a neurosymbolic system that learns how to infer programs that capture visual concepts in a domain-general fashion. We introduce Template Programs: programmatic expressions from a domain-specific language that specify structural and parametric patterns common to an input concept. Our framework supports multiple concept-related tasks, including few-shot generation and co-segmentation through parsing. We develop a learning paradigm that allows us to train networks that infer Template Programs directly from visual datasets that contain concept groupings. We run experiments across multiple visual domains: 2D layouts, Omniglot characters, and 3D shapes. We find that our method outperforms task-specific alternatives, and performs competitively against domain-specific approaches for the limited domains where they exist.
Paper Structure (67 sections, 4 equations, 8 figures, 4 tables)

This paper contains 67 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Our inference process. First, a group of visual inputs are encoded (Step 1). Next, our TemplateNet uses these encodings to infer a Template Program ($TP$, Step 2). The $TP$ and each encoding are then sent to the ExpansionNet to produce a Structural Expansion ($SE$) for each input (Step 3), which are finally passed to the ParamNet to produce a set of complete programs that explain the inputs (Step 4).
  • Figure 2: We learn to infer Template Programs that capture input concepts (Inp). Template Programs produce consistent concept parses (Seg) and synthesize new generations (Gen). Our framework flexibly extends across different visual domains and input representations.
  • Figure 3: Comparing few-shot generations of Omniglot characters.
  • Figure 4: We compare co-segmentations produced from voxelized shapes (Input) to ground-truth annotations (GT)
  • Figure 5: Qualitative examples of unconditional concept generations on the Omniglot domain. We show 30 concepts synthesized by our method where each concept is associated with two rows of five images. The bottom five images depict five samples from each concept, and the top five images show the nearest neighbor in the training set by Chamfer distance to each sample.
  • ...and 3 more figures