Visual Program Distillation with Template-Based Augmentation
Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem
TL;DR
This work tackles the high costs of adapting visual programming to specialized VQA tasks by introducing template-based augmentation and auto-context distillation to train small language models (≤1B parameters) to generate executable visual programs. By decoupling program structure into templates and arguments, the approach enables synthetic data generation without human-written programs and reduces annotation costs to roughly $1 per dataset, while achieving fast inference. The three-stage pipeline—teacher annotation, template-based augmentation, and LoRA-based student training—yields distilled models with competitive answer accuracy and higher program accuracy, and up to 30.8× speedups over teacher models. Auto-context generation often matches or surpasses manual annotations, and data augmentation further improves several metrics, underscoring that API/vision-model reliability largely governs final performance. Overall, template-based visual program distillation enables rapid, cost-effective specialization of visual programming systems for targeted applications, easing deployment on consumer hardware.
Abstract
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
