Table of Contents
Fetching ...

Visual Program Distillation with Template-Based Augmentation

Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem

TL;DR

This work tackles the high costs of adapting visual programming to specialized VQA tasks by introducing template-based augmentation and auto-context distillation to train small language models (≤1B parameters) to generate executable visual programs. By decoupling program structure into templates and arguments, the approach enables synthetic data generation without human-written programs and reduces annotation costs to roughly $1 per dataset, while achieving fast inference. The three-stage pipeline—teacher annotation, template-based augmentation, and LoRA-based student training—yields distilled models with competitive answer accuracy and higher program accuracy, and up to 30.8× speedups over teacher models. Auto-context generation often matches or surpasses manual annotations, and data augmentation further improves several metrics, underscoring that API/vision-model reliability largely governs final performance. Overall, template-based visual program distillation enables rapid, cost-effective specialization of visual programming systems for targeted applications, easing deployment on consumer hardware.

Abstract

Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference

Visual Program Distillation with Template-Based Augmentation

TL;DR

This work tackles the high costs of adapting visual programming to specialized VQA tasks by introducing template-based augmentation and auto-context distillation to train small language models (≤1B parameters) to generate executable visual programs. By decoupling program structure into templates and arguments, the approach enables synthetic data generation without human-written programs and reduces annotation costs to roughly $1 per dataset, while achieving fast inference. The three-stage pipeline—teacher annotation, template-based augmentation, and LoRA-based student training—yields distilled models with competitive answer accuracy and higher program accuracy, and up to 30.8× speedups over teacher models. Auto-context generation often matches or surpasses manual annotations, and data augmentation further improves several metrics, underscoring that API/vision-model reliability largely governs final performance. Overall, template-based visual program distillation enables rapid, cost-effective specialization of visual programming systems for targeted applications, easing deployment on consumer hardware.

Abstract

Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference

Paper Structure

This paper contains 37 sections, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Accuracy vs. Throughput for Visual Program Generation on GQA Generalist LLMs (teacher models) offer high accuracy at the cost of low throughput and large model size (proportional to marker size). With our template-based augmentation method, specialized distilled student models achieve comparable performance on answer accuracy with a small percent of question/answer data ($\approx 0.1\%$) and no human program annotations.
  • Figure 2: An overview of our augmentation method. Programs are first separated into templates and argument, new arguments are selected and plugged back into the question/program pair. Templates are created by renaming variables and removing question specific concepts. One single teacher generated question/program pair can turn into hundreds of new question/program pairs.
  • Figure 3: An example of our data augmentation approach. Both the new and old question have the same template, so the template matcher output should predict the same template for both. The arguments for the new and old programs are different. But, in the arguments, (dog, sofa, brown) should be replaced with (bear, desk, green).
  • Figure 4: The frequency of errors across the different categories for GQA program evaluation. Augmentation reduces the number of 'Does Not Answer Question Mistakes.'
  • Figure 5: 3 question/programs using no augmentation, augmentation and auto-context teacher. Simple comparison questions (left hand side) are almost always correct while questions with negations are almost always incorrect across the different methods.
  • ...and 1 more figures