Table of Contents
Fetching ...

A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang

TL;DR

This work tackles the non-differentiability and weak cross-task performance of Visual Programming (VProg) in Visual Reasoning by introducing Stepwise Distillation Learning Strategy for non-differentiable VPorg (SDVP). SDVP distills knowledge from small, task-specific teachers into larger pre-trained visual models that power VProg's sub-modules, enabling targeted gains on VR tasks like GQA and NLVRv2 while preserving cross-task generalization. The method relies on an Adapter to align interfaces, pseudo-labels from task-specific teachers, and a stepwise distillation loss to guide sub-task learning, with sequential self-distillation to mitigate forgetting. Empirical results show significant target-task improvements and evidence of anti-forgetting, supported by ablations on data size, sub-module choices, and cross-framework transfer. The approach highlights a practical path to continual learning for non-differentiable multi-task VR systems, leveraging task decomposition and distillation instead of end-to-end fine-tuning.

Abstract

Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4\%) and NLVRv2 (+6.2\%) for VisProg and GQA (+6.5\%) and NLVRv2 (+4.0\%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.

A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

TL;DR

This work tackles the non-differentiability and weak cross-task performance of Visual Programming (VProg) in Visual Reasoning by introducing Stepwise Distillation Learning Strategy for non-differentiable VPorg (SDVP). SDVP distills knowledge from small, task-specific teachers into larger pre-trained visual models that power VProg's sub-modules, enabling targeted gains on VR tasks like GQA and NLVRv2 while preserving cross-task generalization. The method relies on an Adapter to align interfaces, pseudo-labels from task-specific teachers, and a stepwise distillation loss to guide sub-task learning, with sequential self-distillation to mitigate forgetting. Empirical results show significant target-task improvements and evidence of anti-forgetting, supported by ablations on data size, sub-module choices, and cross-framework transfer. The approach highlights a practical path to continual learning for non-differentiable multi-task VR systems, leveraging task decomposition and distillation instead of end-to-end fine-tuning.

Abstract

Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4\%) and NLVRv2 (+6.2\%) for VisProg and GQA (+6.5\%) and NLVRv2 (+4.0\%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.
Paper Structure (33 sections, 7 equations, 6 figures, 7 tables)

This paper contains 33 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Performance analysis of VisProg Gupta2022cvpr_VisProg, a classical VProg framework, to invoke different visual models for the four VR tasks. Note that, the target VR task is on NLVRv2 nlvr. Invoking the existing models trained on NLVRv2 to perform as the relevant visual sub-modules enables VisProg to perform superior on NLVRv2, but the capabilities of VisProg on other tasks significantly decline. Our SDVP, which distills the capabilities of models well-trained on NLVRv2 into the sub-modules of VisProg, allows VisProg to improve performance on NLVRv2, while essentially maintaining its capabilities on other tasks, thus preserving the cross-task generalization ability of VisProg.
  • Figure 2: The overview of our proposed SDVP pipeline. SDVP aims to fine-tune visual sub-modules to produce the correct predictions for visual sub-tasks, so as to access the correct final answer of the original question. The right part displays how our SDVP works. During the execution of the program, the visual sub-module "verify_property" is called to verify whether the clothing in the image is red, and it answers "no" wrongly. To rectify it, we use the teacher model to answer the same question, then we let "verify_property" learn it through distillation. After distillation, "verify_property" gives a correct answer "yes". Similarly, after distillation the visual sub-module "simple_query" would answer "shirt" correctly instead of "dress". Because of the correct reasoning of each sub-module, in the final step, we could get the correct answer "shirt" to the original question "Which kind of clothing is red ?" with the given image.
  • Figure 3: Qualitative examples showing the program execution flow of VisProg original and after learning with our SDVP. For each example, we provide detailed process outputs during execution and the final answers, in which Ans means the final answer, GT means the ground truth, Ori means the original VisProg, T means the teacher, SDVP means VisProg after learning with our SDVP.
  • Figure 4: Qualitative examples showing the program execution flow of ViperGPT original and after learning with our SDVP. For each example, we provide detailed process outputs during execution and the final answers, in which Ans means the final answer, GT means the ground truth, Ori means the original ViperGPT, T means the teacher, SDVP means ViperGPT after learning with our SDVP.
  • Figure 5: Sources of errors in VisProg and ViperGPT before and after stepwise distilling on GQA with our SDVP. Here, 'Other Error' includes synonyms, acceptable answers, and dirty data.
  • ...and 1 more figures