Table of Contents
Fetching ...

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning

Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, James Glass

TL;DR

Natural Language Embedded Programs (NLEP) unify language-based reasoning with executable program synthesis by prompting LLMs to emit fully runnable Python code that operates on natural-language-encoded knowledge; a Python interpreter runs the code and returns the result, making the reasoning trace explicit. The approach applies task-general prompts across math, symbolic reasoning, QA, instruction following, and text classification, achieving higher accuracy and improved prompt efficiency than standard chain-of-thought and PoT baselines on most tasks, with GPT-4 showing the strongest gains. NLEP also demonstrates interpretability since the generated programs lay out the reasoning steps executed by the interpreter, and a model-free variant shows potential for fast, interpretable classification. Limitations include variable gains on GSM-Hard and reduced performance for long-form natural language outputs, with future work aiming to extend the technique to longer outputs and more diverse tools while addressing alignment concerns.

Abstract

How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks. Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge. A Python interpreter then executes the generated code and prints the output. Despite using a task-general prompt, we find that this approach can improve upon strong baselines across a range of different tasks including math and symbolic reasoning, text classification, question answering, and instruction following. We found that the generated programs are interpretable since they outline the exact reasoning process followed by the program interpreter.

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning

TL;DR

Natural Language Embedded Programs (NLEP) unify language-based reasoning with executable program synthesis by prompting LLMs to emit fully runnable Python code that operates on natural-language-encoded knowledge; a Python interpreter runs the code and returns the result, making the reasoning trace explicit. The approach applies task-general prompts across math, symbolic reasoning, QA, instruction following, and text classification, achieving higher accuracy and improved prompt efficiency than standard chain-of-thought and PoT baselines on most tasks, with GPT-4 showing the strongest gains. NLEP also demonstrates interpretability since the generated programs lay out the reasoning steps executed by the interpreter, and a model-free variant shows potential for fast, interpretable classification. Limitations include variable gains on GSM-Hard and reduced performance for long-form natural language outputs, with future work aiming to extend the technique to longer outputs and more diverse tools while addressing alignment concerns.

Abstract

How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks. Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge. A Python interpreter then executes the generated code and prints the output. Despite using a task-general prompt, we find that this approach can improve upon strong baselines across a range of different tasks including math and symbolic reasoning, text classification, question answering, and instruction following. We found that the generated programs are interpretable since they outline the exact reasoning process followed by the program interpreter.
Paper Structure (23 sections, 1 equation, 6 figures, 13 tables)

This paper contains 23 sections, 1 equation, 6 figures, 13 tables.

Figures (6)

  • Figure 1: A generated NLEP correctly answers the given question while ChatGPT-4 obtains an incorrect answer (link). This NLEP uses the date-weekday conversion tool in the datetime package, constructs structured knowledge about US presidents, implements a selection function, and outputs natural language responses depending on the function output. A more detailed comparison between NLEP and ChatGPT-4 code interpreter is shown in Figure \ref{['fig:figure1']}.
  • Figure 2: A decision tree structure generated within an NLEP for emotion classification based on task description using an example program for SST2 as the prompt. The branching of each node is decided by a RoBERTa liu2019roberta text entailment model. This language-based decision tree generated by an NLEP outperforms GPT-3 and entailment-based multi-class prediction ge2023entailment without needing any task-specific examples (i.e., exemplars specific to the emotion classification dataset).
  • Figure 3: NLEP generated for solving Dyck language problem. The instruction is "Complete the rest of the sequence, making sure that the parentheses are closed properly." An example for StrategyQA is outlined in Figure \ref{['fig:sqa-example']}.
  • Figure 4: Automatic evaluations of NLEP against standard LLM-based generation with different models. # NLEP > Text means that the % of NLEP responses containing more tokens than the baseline. Detail means if the evaluation metric considers details and response lengths. Score stands for the scores received by NLEP divided by the baseline scores (> 100 means NLEP is better). Win, tie, and lose stand for the % of evaluation cases resulting in each category. Length Bias shows how much the evaluation pipeline prefers longer or shorter answers (lower means fairer, introduced in Appendix \ref{['appendix:lb']}).
  • Figure 5: NLEP answering a question which requires numeric reasoning of structured knowledge. ChatGPT-4 code interpreter (currently the advanced data analysis option) constantly prefers to answer this question with plain natural language.
  • ...and 1 more figures