Table of Contents
Fetching ...

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, Manmohan Chandraker

TL;DR

This paper tackles improving visual program synthesis by training an open LLM through reinforced self-training, addressing the lack of large visual-program datasets. It introduces VisReP, a model-agnostic loop that uses coarse rewards derived from vision-language annotations and a Grow–Improve policy gradient scheme to fine-tune the LLM on visual tasks such as object detection, VQA, and image-text matching. Empirical results show substantial gains over baseline frozen LLMs and competitive performance with GPT-3.5-turbo across multiple tasks, along with analyses of data efficiency and synthesis syntax. The work highlights practical potential for self-guided improvement of LLMs in vision-language tasks and suggests future work toward neural reward modeling for finer-grained feedback.

Abstract

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

TL;DR

This paper tackles improving visual program synthesis by training an open LLM through reinforced self-training, addressing the lack of large visual-program datasets. It introduces VisReP, a model-agnostic loop that uses coarse rewards derived from vision-language annotations and a Grow–Improve policy gradient scheme to fine-tune the LLM on visual tasks such as object detection, VQA, and image-text matching. Empirical results show substantial gains over baseline frozen LLMs and competitive performance with GPT-3.5-turbo across multiple tasks, along with analyses of data efficiency and synthesis syntax. The work highlights practical potential for self-guided improvement of LLMs in vision-language tasks and suggests future work toward neural reward modeling for finer-grained feedback.

Abstract

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP
Paper Structure (29 sections, 4 equations, 14 figures, 4 tables)

This paper contains 29 sections, 4 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Visual program synthesis with LLMs has been treated as a 0/n-shot task where the LLM is kept frozen. This limits opportunities for improvement. We ask whether it is possible to train a LLM to write more accurate programs. Given that there is no large scale dataset of accurate visual programs available, we propose improving the LLM using self-training.
  • Figure 2: VisReP can be applied to improve the visual synthesis abilities of an LLM for a vision-language task using existing annotations for a vision-language task (e.g. an object description+image+bounding boxes). A key idea is to construct a coarse reward by comparing the answer produced by a synthesized program to the ground-truth answer.
  • Figure 3: Self-training with VisReP produces qualitatively better programs. Here, we show programs written by the initial policy (on the left) and the policy after 10 iterations of self-training on GQA (on the right). In VQA example, the initial policy does not specifically check whether the empty basket is plastic. In the object detection example, the reasoning of the initial policy is correct, but it issues a confusingly worded query to the simple_query module, which returns the wrong answer. The learned policy uses simple_query more appropriately. In the image-text matching example, in the initial policy tries to use the object detector to search directly for "meat in a box" and "donuts on a plate", but this is too complicated for the object detector to localize. After self-training, the LLM policy no longer makes this mistake.
  • Figure 4: Iteratively applying VisReP allows a LLM to self-improve improve on almost all of GQA's $\approx$ 100 question types. The base of each bar is set to the accuracy of the initial policy (codellama-7b-instruct). A green bar indicates question types on which the policy at iteration 10 improved over the initial policy, and a red bar indicates question types on which the policy at iteration 10 was worse than the initial policy.
  • Figure 5: Supplying a small amount of human written corrections as in-context examples during training can increase the stability of the self-training process (green line). We show validation accuracy on GQA through multiple iterations of self-training with a policy instantiated from CodeLlama-7b. Without these corrections, proliferating errors cause performance to degrade in later iterations (red line). The translucent shading around each line indicates the standard deviation over 5 evaluations on the validation set.
  • ...and 9 more figures