Table of Contents
Fetching ...

TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks

Zhiruo Wang, Daniel Fried, Graham Neubig

TL;DR

TroVE introduces a training-free framework for inducing reusable function toolboxes to solve programmatic tasks with language models. It maintains a growing library, uses execution agreement to select high-quality solutions, and periodically trims low-utility tools, achieving higher accuracy and simpler solutions across math, table QA, and visual reasoning datasets. Human verification is notably faster and more accurate with TroVE, and the induced functions reveal task and dataset specific specialization. The approach reduces tool library size while maintaining or improving performance, indicating strong practical potential for scalable, verifiable program synthesis across domains.

Abstract

Language models (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we ask code LMs to curate reusable high-level functions, and use them to write solutions. We present TROVE, a training-free method of inducing a verifiable and efficient toolbox of functions, by generating via using, growing, and periodically trimming the toolbox. On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TROVE further enables 31% faster and 13% more accurate human verification than baselines. With the same pipeline, it creates diverse functions for varied tasks and datasets, providing insights into their individual characteristics.

TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks

TL;DR

TroVE introduces a training-free framework for inducing reusable function toolboxes to solve programmatic tasks with language models. It maintains a growing library, uses execution agreement to select high-quality solutions, and periodically trims low-utility tools, achieving higher accuracy and simpler solutions across math, table QA, and visual reasoning datasets. Human verification is notably faster and more accurate with TroVE, and the induced functions reveal task and dataset specific specialization. The approach reduces tool library size while maintaining or improving performance, indicating strong practical potential for scalable, verifiable program synthesis across domains.

Abstract

Language models (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we ask code LMs to curate reusable high-level functions, and use them to write solutions. We present TROVE, a training-free method of inducing a verifiable and efficient toolbox of functions, by generating via using, growing, and periodically trimming the toolbox. On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TROVE further enables 31% faster and 13% more accurate human verification than baselines. With the same pipeline, it creates diverse functions for varied tasks and datasets, providing insights into their individual characteristics.
Paper Structure (49 sections, 10 figures, 9 tables)

This paper contains 49 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Our TroVE induces reusable functions to produce better program solutions than the Primitive setting, without training, supervision, or iterations required by existing methods.
  • Figure 2: Function design affect solutions. Using primitive functions results in complex, error-prone solutions (middle), while using abstract functions leads to more concise and accurate solutions (right).
  • Figure 3: Example input prompt on the tabular environment. We textualize tables by markdown format in the prompt, and ask LMs to parse tables into pandas DataFrame in program solutions.
  • Figure 4: An illustration of a model inducing a composed function and using it to generate the solution at the same time.
  • Figure 5: TroVE illustration. Top: generate solutions while using and growing the toolbox. Bottom: select the best response by execution agreement. Left: periodically trim low-utility functions.
  • ...and 5 more figures