Table of Contents
Fetching ...

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee

TL;DR

The paper tackles latency in autoregressive LLM inference by improving speculative decoding through a self-supervised judge verifier. SelfJudge trains verifiers from the target model’s own semantics using a semantic preservation score, quantified as a likelihood difference when substituting draft tokens, enabling automatic data generation without ground-truth labels. Empirically, SelfJudge achieves faster inference with smaller drops in task performance across GSM8K, MATH-500, LiveCodeBench, CNN/DailyMail, and MMLU, outperforming prior AutoJudge baselines in both speed and robustness. The approach generalizes beyond math and coding to open-ended NLP tasks due to its automatic data generation and a lightweight, bidirectionally informed verifier, marking a practical advancement for scalable, fast LLM inference.

Abstract

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

TL;DR

The paper tackles latency in autoregressive LLM inference by improving speculative decoding through a self-supervised judge verifier. SelfJudge trains verifiers from the target model’s own semantics using a semantic preservation score, quantified as a likelihood difference when substituting draft tokens, enabling automatic data generation without ground-truth labels. Empirically, SelfJudge achieves faster inference with smaller drops in task performance across GSM8K, MATH-500, LiveCodeBench, CNN/DailyMail, and MMLU, outperforming prior AutoJudge baselines in both speed and robustness. The approach generalizes beyond math and coding to open-ended NLP tasks due to its automatic data generation and a lightweight, bidirectionally informed verifier, marking a practical advancement for scalable, fast LLM inference.

Abstract

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

Paper Structure

This paper contains 31 sections, 10 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Inference efficiency and task performance comparison of SD methods on GSM8K (a,b) and MMLU (c,d). AutoJudge shows domain-specific limitations, performing well on mathematical reasoning but poorly on general knowledge tasks, while SelfJudge maintains consistent performance across both domains. $\gamma$ represents the number of tokens generated by draft model per step.
  • Figure 2: The training data generation process of SelfJudge for the verifier. Our approach compares the likelihood of the replaced response with the original response to measure the semantic preservation score. If semantic preservation score is higher than $\tau$, the replaced token is labeled as acceptable. After the token labeling process, we train the verifier that will be used during the inference phase for draft verification.
  • Figure 3: Speed/Performance comparison across different methods. We report the accuracy, with the corresponding average accepted length, by searching the threshold values of each method.
  • Figure 4: Performance of SelfJudge over a range of suffix length $N$. We compute the semantic score by including the likelihood computed on $N$ future tokens.
  • Figure 4: Freuquently appeared token acceptance by SelfJudge on Math Problem (GSM8K)
  • ...and 1 more figures