Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Yichao Wu; Penghao Liang; Yafei Xiang; Mengwei Yuan; Jianan Liu; Jing Yang; Xianyou Li; Weiran Yan

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Yichao Wu, Penghao Liang, Yafei Xiang, Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan

TL;DR

Tiny-Critic RAG is proposed, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA), Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing.

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 4 figures, 1 table)

This paper contains 17 sections, 2 equations, 4 figures, 1 table.

Introduction
Related Work
Evolution of Agentic RAG and Reflection
Parameter-Efficient Tuning and Lightweight Critics
Methodology
Problem Formulation and DAG Routing State Space
Low-Rank Adaptation for Boundary Formulation
Inference Acceleration via Constrained Decoding
Experimental Setup
Datasets and Adversarial Noise Injection Protocol
Baselines and Model Configuration
Evaluation Metrics
Experimental Results and Analyses
Routing Efficacy and Noise Robustness
Latency and Cost Profiling
...and 2 more sections

Figures (4)

Figure 1: Architectural comparison highlighting the implicit cost of noise. Left: In an unprotected Agentic RAG (e.g., ReAct), adversarial noise triggers spurious multi-hop reasoning spirals, causing catastrophic token waste and high TTFT. Right: Our Tiny-Critic Agentic RAG preemptively intercepts noise ($a=0$), routing the query to a fallback tool to retrieve clean evidence ($D'$), effectively isolating the generator from hallucinations.
Figure 2: Robustness under $45\%$ adversarial noise. Tiny-Critic intercepts distractors, preventing catastrophic degradation.
Figure 3: Routing TTFT comparison. Tiny-Critic RAG achieves a 94.6% reduction in routing overhead compared to Heavy-CRAG.
Figure 4: Cost-Performance Pareto Frontier. Tiny-Critic optimally balances noise robustness with near-zero marginal evaluation cost.

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

TL;DR

Abstract

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)