Table of Contents
Fetching ...

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Yichao Wu, Penghao Liang, Yafei Xiang, Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan

TL;DR

Tiny-Critic RAG is proposed, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA), Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing.

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

TL;DR

Tiny-Critic RAG is proposed, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA), Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing.

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.
Paper Structure (17 sections, 2 equations, 4 figures, 1 table)

This paper contains 17 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Architectural comparison highlighting the implicit cost of noise. Left: In an unprotected Agentic RAG (e.g., ReAct), adversarial noise triggers spurious multi-hop reasoning spirals, causing catastrophic token waste and high TTFT. Right: Our Tiny-Critic Agentic RAG preemptively intercepts noise ($a=0$), routing the query to a fallback tool to retrieve clean evidence ($D'$), effectively isolating the generator from hallucinations.
  • Figure 2: Robustness under $45\%$ adversarial noise. Tiny-Critic intercepts distractors, preventing catastrophic degradation.
  • Figure 3: Routing TTFT comparison. Tiny-Critic RAG achieves a 94.6% reduction in routing overhead compared to Heavy-CRAG.
  • Figure 4: Cost-Performance Pareto Frontier. Tiny-Critic optimally balances noise robustness with near-zero marginal evaluation cost.