Table of Contents
Fetching ...

ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Yuwei Yin, Giuseppe Carenini

TL;DR

ARR introduces a three-step QA framework—analyze intent, retrieve relevant information, and reason step-by-step—prompted at test time to improve QA performance across diverse tasks. Through extensive experiments with open-weight LLMs, ARR consistently outperforms Direct Answer and Chain-of-Thought prompting, with intent analysis delivering the largest gains. Ablation studies and prompt-variant tests validate the contribution and robustness of each component. The framework demonstrates strong generalization across model sizes, LLM families, and generation settings, suggesting practical impact for robust, scalable QA with LLMs.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.

ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

TL;DR

ARR introduces a three-step QA framework—analyze intent, retrieve relevant information, and reason step-by-step—prompted at test time to improve QA performance across diverse tasks. Through extensive experiments with open-weight LLMs, ARR consistently outperforms Direct Answer and Chain-of-Thought prompting, with intent analysis delivering the largest gains. Ablation studies and prompt-variant tests validate the contribution and robustness of each component. The framework demonstrates strong generalization across model sizes, LLM families, and generation settings, suggesting practical impact for robust, scalable QA with LLMs.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.

Paper Structure

This paper contains 49 sections, 6 equations, 3 figures, 22 tables.

Figures (3)

  • Figure 1: ARR motivation. To answer a question, we often need to analyze the question's intent, retrieve relevant information, and reason step by step.
  • Figure 2: Question answering with LLMs. We first obtain rationale $r_i$ by reasoning generation and then select the optimal option via evaluating the language modeling losses of different context-option combinations.
  • Figure 3: Experiments on prompt variants. The average performance (Accuracy %) of the LLaMA3-8B-Chat model on 10 QA datasets using different ARR prompt variants ("V1"--"V5").