A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks
Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang
TL;DR
This work tackles the non-differentiability and weak cross-task performance of Visual Programming (VProg) in Visual Reasoning by introducing Stepwise Distillation Learning Strategy for non-differentiable VPorg (SDVP). SDVP distills knowledge from small, task-specific teachers into larger pre-trained visual models that power VProg's sub-modules, enabling targeted gains on VR tasks like GQA and NLVRv2 while preserving cross-task generalization. The method relies on an Adapter to align interfaces, pseudo-labels from task-specific teachers, and a stepwise distillation loss to guide sub-task learning, with sequential self-distillation to mitigate forgetting. Empirical results show significant target-task improvements and evidence of anti-forgetting, supported by ablations on data size, sub-module choices, and cross-framework transfer. The approach highlights a practical path to continual learning for non-differentiable multi-task VR systems, leveraging task decomposition and distillation instead of end-to-end fine-tuning.
Abstract
Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4\%) and NLVRv2 (+6.2\%) for VisProg and GQA (+6.5\%) and NLVRv2 (+4.0\%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.
