Table of Contents
Fetching ...

Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations

Kai Tzu-iunn Ong, Taeyoon Kwon, Jinyoung Yeo

TL;DR

Self-Taught introduces a fully zero-shot framework that automatically generates tailored demonstrations for each test instance to guide LLM reasoning. It identifies the target information, constructs high-quality pseudo problems and high-certainty solutions, and uses these tailored demonstrations to solve the target problem, reducing reliance on costly human demonstrations. Across 13 diverse QA tasks and two real-world Alzheimer's disease diagnosis datasets, Self-Taught outperforms strong baselines and demonstrates robustness to different prompting strategies and open-source LLMs, though it shows limitations in highly homogeneous clinical cases where manual CoT remains competitive. The work highlights a practical, cost-efficient path to enhance domain-specific LLM applications, with detailed ablations, human evaluations, and supplementary resources supporting adoption and extension.

Abstract

Guiding large language models with a selected set of human-authored demonstrations is a common practice for improving LLM applications. However, human effort can be costly, especially in specialized domains (e.g., clinical diagnosis), and does not guarantee optimal performance due to the potential discrepancy of target skills between selected demonstrations and real test instances. Motivated by these, this paper explores the automatic creation of customized demonstrations, whose target skills align with the given target instance. We present SELF-TAUGHT, a problem-solving framework, which facilitates demonstrations that are "tailored" to the target problem and "filtered" for better quality (i.e., correctness) in a zero-shot manner. In 15 tasks of multiple-choice questions of diverse domains and the diagnosis of Alzheimer's disease (AD) with real-world patients, SELF-TAUGHT achieves superior performance to strong baselines (e.g., Few-shot CoT, Plan-and-Solve, Auto-CoT). We conduct comprehensive analyses on SELF-TAUGHT, including its generalizability to existing prompting methods and different LLMs, the quality of its intermediate generation, and more.

Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations

TL;DR

Self-Taught introduces a fully zero-shot framework that automatically generates tailored demonstrations for each test instance to guide LLM reasoning. It identifies the target information, constructs high-quality pseudo problems and high-certainty solutions, and uses these tailored demonstrations to solve the target problem, reducing reliance on costly human demonstrations. Across 13 diverse QA tasks and two real-world Alzheimer's disease diagnosis datasets, Self-Taught outperforms strong baselines and demonstrates robustness to different prompting strategies and open-source LLMs, though it shows limitations in highly homogeneous clinical cases where manual CoT remains competitive. The work highlights a practical, cost-efficient path to enhance domain-specific LLM applications, with detailed ablations, human evaluations, and supplementary resources supporting adoption and extension.

Abstract

Guiding large language models with a selected set of human-authored demonstrations is a common practice for improving LLM applications. However, human effort can be costly, especially in specialized domains (e.g., clinical diagnosis), and does not guarantee optimal performance due to the potential discrepancy of target skills between selected demonstrations and real test instances. Motivated by these, this paper explores the automatic creation of customized demonstrations, whose target skills align with the given target instance. We present SELF-TAUGHT, a problem-solving framework, which facilitates demonstrations that are "tailored" to the target problem and "filtered" for better quality (i.e., correctness) in a zero-shot manner. In 15 tasks of multiple-choice questions of diverse domains and the diagnosis of Alzheimer's disease (AD) with real-world patients, SELF-TAUGHT achieves superior performance to strong baselines (e.g., Few-shot CoT, Plan-and-Solve, Auto-CoT). We conduct comprehensive analyses on SELF-TAUGHT, including its generalizability to existing prompting methods and different LLMs, the quality of its intermediate generation, and more.
Paper Structure (59 sections, 7 equations, 14 figures, 12 tables)

This paper contains 59 sections, 7 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Empirical examples and the overview of Self-Taught. All phases are executed under a zero-shot setting.
  • Figure 2: Cost-performance comparisons (ours' cost = $1.0$).
  • Figure 3: Ours' performances with Llama-3.1-8B (acc).
  • Figure 4: Human evaluation of ours' intermediate outputs. We present the percentage of approval voting.
  • Figure 5: Demonstrations in Retrieval CoT and Ours.
  • ...and 9 more figures