Table of Contents
Fetching ...

Luna-2: Scalable Single-Token Evaluation with Small Language Models

Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao, Yash Sheth

TL;DR

Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation is presented.

Abstract

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.

Luna-2: Scalable Single-Token Evaluation with Small Language Models

TL;DR

Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation is presented.

Abstract

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.
Paper Structure (23 sections, 6 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of Luna-2 with LLM-as-judge and other common baselines for evaluating guardrails. Note that the Y-axis is the F1 score, X-axis is the cost in dollars per million tokens, and the bubble size is the latency in milliseconds (shown for comparison).
  • Figure 2: Luna-2 architecture overview. A shared decoder-only backbone processes the input prompt, and metric-specific LoRA adapters produce single-token outputs with calibrated probabilities.
  • Figure 3: Domain distribution of the Context Adherence training dataset. The dataset spans multiple domains with Finance & Banking and Healthcare & Medical representing the largest portions, ensuring robust performance across diverse production environments.
  • Figure 4: ROC curve for the Prompt Injection metric. Trained on internal data and evaluated on the xTRam1 Safe-Guard Prompt Injection dataset.
  • Figure 5: ROC curve for the Context Adherence metric. Trained on internal data and evaluated on the open-source RAGBench dataset.