Table of Contents
Fetching ...

Program Synthesis Dialog Agents for Interactive Decision-Making

Matthew Toles, Nikhil Balwani, Rattandeep Singh, Valentina Giulia Sartori Rodriguez, Zhou Yu

TL;DR

This work tackles interactive decision-making for real-world eligibility tasks by introducing BeNYfits, a benchmark for overlapping public-benefit programs, and ProADA, a program-synthesis based agent that generates a Python decision tool to guide dialog. BeNYfits evaluates accuracy and dialog efficiency across diverse and representative user populations, using ground-truth eligibility checkers derived from human-written requirements. ProADA outperforms strong baselines on both accuracy and efficiency, achieving F1 scores in the mid-50s with substantially fewer dialog turns, and benefits from human-in-the-loop usability improvements in a concurrent user study. The approach emphasizes transparency and reliability by using a generated Python tool for reasoning, reducing hallucinations and enabling traceable decision logic, while acknowledging limitations related to data sources and real-world deployment risks.

Abstract

Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.

Program Synthesis Dialog Agents for Interactive Decision-Making

TL;DR

This work tackles interactive decision-making for real-world eligibility tasks by introducing BeNYfits, a benchmark for overlapping public-benefit programs, and ProADA, a program-synthesis based agent that generates a Python decision tool to guide dialog. BeNYfits evaluates accuracy and dialog efficiency across diverse and representative user populations, using ground-truth eligibility checkers derived from human-written requirements. ProADA outperforms strong baselines on both accuracy and efficiency, achieving F1 scores in the mid-50s with substantially fewer dialog turns, and benefits from human-in-the-loop usability improvements in a concurrent user study. The approach emphasizes transparency and reliability by using a generated Python tool for reasoning, reducing hallucinations and enabling traceable decision logic, while acknowledging limitations related to data sources and real-world deployment risks.

Abstract

Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.

Paper Structure

This paper contains 39 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Interactive decision-making dialog loop in BeNYfits. The agent is initialized with opportunity eligibility requirements for the "Train & Earn" opportunity (simplified). The agent then asks questions to the user until the agent answers Yes to the Ready prompt, at which point it Predicts the user's eligibility. Note that the agent skips requirement 3a because youth cannot register for selective service. Similarly, it skips requirement 3c because it becomes irrelevant if the user is a former foster care youth.
  • Figure 2: Number of opportunities dependent on each household feature. For example, 53 of 82 programs rely on age to determine eligibility. Top 20 features shown.
  • Figure 3: ProADA architecture. ProADA consists of the checker tool created by the code generation module (left) and the dialog module (center). The checker tool is a Python function that determines user eligibility from a structured user representation dictionary (right). The ProADA dialog module acts as an interface between the checker tool and the user. On each dialog turn, the agent runs the checker tool on the user dictionary, which is initially empty. On a key error, the dialog module fills in a single key-value pair by asking a user a question and converting the answer to a value consistent with the checker tool logic. The dialog ends once a value is returned by the checker tool for every opportunity.
  • Figure 4: Average of Representative and Diverse dataset F1 vs. dialog turns to completion for ProADA and baseline models. Legend follows the format Strategy (Code Model) - Dialog Model.
  • Figure 5: ProADA program synthesis errors