Table of Contents
Fetching ...

SIFT: Grounding LLM Reasoning in Contexts via Stickers

Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng

TL;DR

The paper addresses factual drift in LLM reasoning, where context is misinterpreted during multi-step inference. It introduces SIFT, a training-free framework that grounds reasoning in context by generating a Sticker from the query, producing two predictions (Sticker-only and Query+Sticker), and refining the Sticker via Forward Optimization and Inverse Generation until the predictions align. Across models from 3B to 100B+ and benchmarks such as GSM8K, MATH-500, GPQA-Diamond, and AIME2024, SIFT delivers consistent improvements, including a pass@1 gain on AIME2024 for DeepSeek-R1 from $78.33\%$ to $85.67\%$ and about a $1.03$ percentage-point boost on MATH-500 from a baseline of $97.3\%$, establishing a new open-source state-of-the-art. The approach also synergizes with Self-Consistency, demonstrates iterative optimization benefits, and relies on a clear Consensus Prediction strategy, all while avoiding additional training data. Code for SIFT is publicly available, enabling practitioners to adopt fact-grounded reasoning in diverse settings.

Abstract

This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.

SIFT: Grounding LLM Reasoning in Contexts via Stickers

TL;DR

The paper addresses factual drift in LLM reasoning, where context is misinterpreted during multi-step inference. It introduces SIFT, a training-free framework that grounds reasoning in context by generating a Sticker from the query, producing two predictions (Sticker-only and Query+Sticker), and refining the Sticker via Forward Optimization and Inverse Generation until the predictions align. Across models from 3B to 100B+ and benchmarks such as GSM8K, MATH-500, GPQA-Diamond, and AIME2024, SIFT delivers consistent improvements, including a pass@1 gain on AIME2024 for DeepSeek-R1 from to and about a percentage-point boost on MATH-500 from a baseline of , establishing a new open-source state-of-the-art. The approach also synergizes with Self-Consistency, demonstrates iterative optimization benefits, and relies on a clear Consensus Prediction strategy, all while avoiding additional training data. Code for SIFT is publicly available, enabling practitioners to adopt fact-grounded reasoning in diverse settings.

Abstract

This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.

Paper Structure

This paper contains 13 sections, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: Applying SIFT to DeepSeek-R1 demonstrates highly competitive reasoning performance on AIME2024, AIME2025, and MATH-500 (pass@1 accuracy). The results for o1-mini and o3-mini on AIME are referenced from ye2025aimepreview.
  • Figure 2: An example of a query and its Sticker.
  • Figure 3: Factual drift occurs during (i) Sticker generation and (ii) prediction generation from Sticker.
  • Figure 4: Self-verification occurs during DeepSeek-R1's reasoning, where the model revisiting the query, focusing on key information, and paraphrasing it.
  • Figure 5: Four core operations in SIFT: (i) Sticker Generation (SG), (ii) Consensus Prediction (CP), (iii) Forward Optimization (FO), (iv) Inverse Generation (IG).
  • ...and 10 more figures