Table of Contents
Fetching ...

MIDGARD: Self-Consistency Using Minimum Description Length for Structured Commonsense Reasoning

Inderjeet Nair, Lu Wang

TL;DR

This work tackles structured commonsense reasoning via graphs generated by LLMs, addressing error propagation and single-sample limitations. It introduces MIDGARD, an MDL-guided aggregation framework that merges multiple graph samples into a single DAG by minimizing the expected description length of samples relative to a hypothesized graph, effectively promoting consistently observed edges/nodes. The approach demonstrates robust improvements across argument structure extraction, explanation graph generation, script planning, and semantic graph generation on eight benchmarks, using both GPT-3.5-turbo and Code-Llama, with DAG constraints playing a crucial role in maintaining valid graph structures. However, MIDGARD incurs higher computational cost due to multiple samples and ILP-based DAG enforcement, and its performance can depend on hyperparameters and the variability of the underlying LLM outputs; ethical considerations regarding hallucination and biased content are acknowledged.

Abstract

We study the task of conducting structured reasoning as generating a reasoning graph from natural language input using large language models (LLMs). Previous approaches have explored various prompting schemes, yet they suffer from error propagation due to the autoregressive nature and single-pass-based decoding, which lack error correction capability. Additionally, relying solely on a single sample may result in the omission of true nodes and edges. To counter this, we draw inspiration from self-consistency (SC), which involves sampling a diverse set of reasoning chains and taking the majority vote as the final answer. To tackle the substantial challenge of applying SC on generated graphs, we propose MIDGARD (MInimum Description length Guided Aggregation of Reasoning in Directed acyclic graph) that leverages Minimum Description Length (MDL)-based formulation to identify consistent properties among the different graph samples generated by an LLM. This formulation helps reject properties that appear in only a few samples, which are likely to be erroneous, while enabling the inclusion of missing elements without compromising precision. Our method demonstrates superior performance than comparisons across various structured reasoning tasks, including argument structure extraction, explanation graph generation, inferring dependency relations among actions for everyday tasks, and semantic graph generation from natural texts.

MIDGARD: Self-Consistency Using Minimum Description Length for Structured Commonsense Reasoning

TL;DR

This work tackles structured commonsense reasoning via graphs generated by LLMs, addressing error propagation and single-sample limitations. It introduces MIDGARD, an MDL-guided aggregation framework that merges multiple graph samples into a single DAG by minimizing the expected description length of samples relative to a hypothesized graph, effectively promoting consistently observed edges/nodes. The approach demonstrates robust improvements across argument structure extraction, explanation graph generation, script planning, and semantic graph generation on eight benchmarks, using both GPT-3.5-turbo and Code-Llama, with DAG constraints playing a crucial role in maintaining valid graph structures. However, MIDGARD incurs higher computational cost due to multiple samples and ILP-based DAG enforcement, and its performance can depend on hyperparameters and the variability of the underlying LLM outputs; ethical considerations regarding hallucination and biased content are acknowledged.

Abstract

We study the task of conducting structured reasoning as generating a reasoning graph from natural language input using large language models (LLMs). Previous approaches have explored various prompting schemes, yet they suffer from error propagation due to the autoregressive nature and single-pass-based decoding, which lack error correction capability. Additionally, relying solely on a single sample may result in the omission of true nodes and edges. To counter this, we draw inspiration from self-consistency (SC), which involves sampling a diverse set of reasoning chains and taking the majority vote as the final answer. To tackle the substantial challenge of applying SC on generated graphs, we propose MIDGARD (MInimum Description length Guided Aggregation of Reasoning in Directed acyclic graph) that leverages Minimum Description Length (MDL)-based formulation to identify consistent properties among the different graph samples generated by an LLM. This formulation helps reject properties that appear in only a few samples, which are likely to be erroneous, while enabling the inclusion of missing elements without compromising precision. Our method demonstrates superior performance than comparisons across various structured reasoning tasks, including argument structure extraction, explanation graph generation, inferring dependency relations among actions for everyday tasks, and semantic graph generation from natural texts.
Paper Structure (35 sections, 8 equations, 15 figures, 10 tables)

This paper contains 35 sections, 8 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Comparison of MIDGARD with CoCoGen. In this example, our objective is to infer dependency relations among items in the "Action List" to achieve the specified "Objective". CoCoGen uses greedy decoding and exhibits errors in the output, e.g., "decided to run errands during a break in the rain" is not connected with "Drive to errand location and complete". In contrast, our approach MIDGARD (within the HTML]FCE5CDorange rectangle) aggregates relevant information across different samples, resulting in more accurate inference. For this example, our algorithm improved the performance of greedy decoding from $\mathbf{66.7}$ to $\mathbf{85.7}$ in edge $F_1$-score.
  • Figure 2: Pictorial representation of Graph Aggregation. In the figure above, the probabilities of node/edge existence in a randomly generated sample from an LLM are estimated by the normalized frequency of their occurrence in the samples. The weight of an edge or node on the right-hand side is determined by subtracting $(1 - \lambda_1)$ or $(1 - \lambda_2)$ from this probability, respectively. The optimization in Eq. \ref{['eqn:mdl_generic']} is equivalent to the selection of the properties in the aggregated graph such that the sum of weights is maximized. The bolded elements are selected according to this maximization.
  • Figure 3: Results for script planning on Proscript.
  • Figure 4: Performance of MIDGARD in comparison with Greedy on Essays when the number of samples from the LLM is varied. Results averaged over $5$ different random seeds.
  • Figure 5: Performance of MIDGARD in comparison with Greedy on Explagraphs.
  • ...and 10 more figures