Table of Contents
Fetching ...

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

TL;DR

Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the question, improves LLM accuracy consistently on over 22 tasks from arithmetic, reading comprehension, to logical reasoning.

Abstract

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the question. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Compared to vanilla chain of thought prompting (CoT), HoT reduces the rate of hallucination and separately improves LLM accuracy consistently on over 22 tasks from arithmetic, reading comprehension, to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to fool users into believing that an answer is correct.

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

TL;DR

Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the question, improves LLM accuracy consistently on over 22 tasks from arithmetic, reading comprehension, to logical reasoning.

Abstract

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the question. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Compared to vanilla chain of thought prompting (CoT), HoT reduces the rate of hallucination and separately improves LLM accuracy consistently on over 22 tasks from arithmetic, reading comprehension, to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to fool users into believing that an answer is correct.

Paper Structure

This paper contains 59 sections, 5 equations, 14 figures, 48 tables.

Figures (14)

  • Figure 1: CoT and HoT (ours) responses for a MATH500 question in ReasoningTrap benchmarks jang2025reasoning, both generated by Gemini-1.5-Flash. Left: CoT misses the key constraint $b=c=d=0$, giving an incorrect answer. Right: HoT (a re-formatted question and answer) applies the key constraint $b=c=d=0$ to the expression $ab^2c^3d^4$, yielding the correct answer of 0. The full reasoning traces of both methods are provided in \ref{['tab:gcot_vs_cot_math500_conditioned_math']}.
  • Figure 2: 8-shot HoT examples are provided in addition to the explicit directions (HoT Instruction) (see \ref{['sec:system_prompt']}) to help LLMs understand the expected format. See \ref{['tab:full_fewshot_prompt_example']} for one entire example prompt.
  • Figure 3: LLMs generate HoT responses by wrapping XML tags around the information that the model determines is the most important. Regex and CSS are then used to visualize the highlights for user readability (see the code to convert XML tags to highlights in \ref{['sec:code']}).
  • Figure 4: HoT ablation study: Every component---repeating the question (R-Q), adding tags to only question (T-Q), adding tags to only answer (T-A)---independently contributes to the overall accuracy of HoT prompting ($+$). Each component also outperforms the vanilla CoT (-$\times$-). $y$-axis shows mean accuracy across 6 datasets (the detailed accuracy of each dataset is in \ref{['sec:detail_prompt_variation_app']}).
  • Figure 5: Left: After finetuned via SFT on CoT examples, Qwen-2.5-1.5B answers incorrectly an adversarial question from PuzzleTrivial as it does not factor in the key fact of "permanently infertile lions". Right: In contrast, HoT-finetuned counterpart LLM can highlight facts and answer correctly using the fact ("the lions would never reproduce").
  • ...and 9 more figures