Table of Contents
Fetching ...

Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem

Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt

TL;DR

Cross-Refine addresses the reliability gap in natural language explanations by introducing a two-LLM cross-refinement framework with a generator and a critic. The method leverages in-context learning and external critique to iteratively improve NLEs across multiple tasks without supervised training data. Empirical results show Cross-Refine often outperforms Self-Refine, especially with less powerful LLMs, though domain knowledge limits in medicine and cross-language challenges remain. An ablation study confirms that both the critic's feedback and its suggested explanations are crucial for effective refinement, and results on HealthFC demonstrate the potential for bilingual NLE generation.

Abstract

Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.

Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem

TL;DR

Cross-Refine addresses the reliability gap in natural language explanations by introducing a two-LLM cross-refinement framework with a generator and a critic. The method leverages in-context learning and external critique to iteratively improve NLEs across multiple tasks without supervised training data. Empirical results show Cross-Refine often outperforms Self-Refine, especially with less powerful LLMs, though domain knowledge limits in medicine and cross-language challenges remain. An ablation study confirms that both the critic's feedback and its suggested explanations are crucial for effective refinement, and results on HealthFC demonstrate the potential for bilingual NLE generation.

Abstract

Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.
Paper Structure (54 sections, 5 equations, 10 figures, 9 tables)

This paper contains 54 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Cross-Refine example of the question "Where would you borrow coffee if you do not have any?" from ECQA. The initial explanation by the generator has been accurately corrected and refined based on the feedback and explanations provided by the critic.
  • Figure 2: Pipeline of Cross-Refine. (1) Generator: produces an initial explanation. (2): Critic: provides feedback and an suggested explanation based on the generator's initial output. (3) Generator: utilizes the feedback and suggested explanation from the critic to refine and improve the initial explanation.
  • Figure 3: Cross-Refine example on ECQA dataset.
  • Figure 4: Cross-Refine example on eSNLI dataset.
  • Figure 5: Cross-Refine example on HealthFC dataset.
  • ...and 5 more figures