Table of Contents
Fetching ...

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

TL;DR

ObjexMT tackles the challenge of recovering latent dialogue objectives and calibrating self-reported confidence for LLM-based judges under multi-turn jailbreaks. It defines a benchmark where models output a one-sentence base objective and a confidence score, which a fixed judge then evaluates against gold objectives using semantic similarity and a calibrated threshold. Across six models and three safety datasets, objective-extraction accuracy ranges from 0.47 to 0.61, and calibration remains imperfect, with high-confidence errors persisting, illustrating a practical reliability gap for LLM judges. The work provides actionable guidance, including confidence-gated decision-making and explicit objective surfacing, and releases data and code to support reproducible evaluation and extension to related latent-intent tasks like multi-hop QA and tool-use auditing.

Abstract

LLM-as-a-Judge (LLMaaJ) enables scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover the hidden objective of a conversation and know when that inference is reliable? Large language models degrade with irrelevant or lengthy context, and multi-turn jailbreaks can scatter goals across turns. We present ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence. Accuracy is scored by semantic similarity to gold objectives, then thresholded once on 300 calibration items ($τ^\star = 0.66$; $F_1@τ^\star = 0.891$). Metacognition is assessed with expected calibration error, Brier score, Wrong@High-Confidence (0.80 / 0.90 / 0.95), and risk--coverage curves. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2, deepseek-v3.1, gemini-2.5-flash) evaluated on SafeMTData\_Attack600, SafeMTData\_1K, and MHJ, kimi-k2 achieves the highest objective-extraction accuracy (0.612; 95\% CI [0.594, 0.630]), while claude-sonnet-4 (0.603) and deepseek-v3.1 (0.599) are statistically tied. claude-sonnet-4 offers the best selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254). Performance varies sharply across datasets (16--82\% accuracy), showing that automated obfuscation imposes challenges beyond model choice. High-confidence errors remain: Wrong@0.90 ranges from 14.9\% (claude-sonnet-4) to 47.7\% (Qwen3-235B-A22B-FP8). ObjexMT therefore supplies an actionable test for LLM judges: when objectives are implicit, judges often misinfer them; exposing objectives or gating decisions by confidence is advisable. All experimental data are in the Supplementary Material and at https://github.com/hyunjun1121/ObjexMT_dataset.

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

TL;DR

ObjexMT tackles the challenge of recovering latent dialogue objectives and calibrating self-reported confidence for LLM-based judges under multi-turn jailbreaks. It defines a benchmark where models output a one-sentence base objective and a confidence score, which a fixed judge then evaluates against gold objectives using semantic similarity and a calibrated threshold. Across six models and three safety datasets, objective-extraction accuracy ranges from 0.47 to 0.61, and calibration remains imperfect, with high-confidence errors persisting, illustrating a practical reliability gap for LLM judges. The work provides actionable guidance, including confidence-gated decision-making and explicit objective surfacing, and releases data and code to support reproducible evaluation and extension to related latent-intent tasks like multi-hop QA and tool-use auditing.

Abstract

LLM-as-a-Judge (LLMaaJ) enables scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover the hidden objective of a conversation and know when that inference is reliable? Large language models degrade with irrelevant or lengthy context, and multi-turn jailbreaks can scatter goals across turns. We present ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence. Accuracy is scored by semantic similarity to gold objectives, then thresholded once on 300 calibration items (; ). Metacognition is assessed with expected calibration error, Brier score, Wrong@High-Confidence (0.80 / 0.90 / 0.95), and risk--coverage curves. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2, deepseek-v3.1, gemini-2.5-flash) evaluated on SafeMTData\_Attack600, SafeMTData\_1K, and MHJ, kimi-k2 achieves the highest objective-extraction accuracy (0.612; 95\% CI [0.594, 0.630]), while claude-sonnet-4 (0.603) and deepseek-v3.1 (0.599) are statistically tied. claude-sonnet-4 offers the best selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254). Performance varies sharply across datasets (16--82\% accuracy), showing that automated obfuscation imposes challenges beyond model choice. High-confidence errors remain: Wrong@0.90 ranges from 14.9\% (claude-sonnet-4) to 47.7\% (Qwen3-235B-A22B-FP8). ObjexMT therefore supplies an actionable test for LLM judges: when objectives are implicit, judges often misinfer them; exposing objectives or gating decisions by confidence is advisable. All experimental data are in the Supplementary Material and at https://github.com/hyunjun1121/ObjexMT_dataset.

Paper Structure

This paper contains 42 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Per-dataset objective–extraction accuracy across models. Heatmap cells report accuracy after LLM–judge similarity thresholding at $\tau^\star{=}0.66$ on the human-aligned set (one pass per item; $N{=}2{,}817$ items/model). Rows are datasets SafeMTData_Attack600, SafeMTData_1K, MHJ; columns are the six models. The pattern reveals strong heterogeneity: MHJ is consistently easiest (e.g., gpt-4.1$0.816$, kimi-k2$0.857$), while Attack600 is hardest (range $0.162$–$0.333$). 1K sits in between (e.g., claude-sonnet-4 and kimi-k2 both $0.635$), indicating that dataset construction and obfuscation level drive difficulty. Darker cells denote higher accuracy.
  • Figure 2: Calibration comparison from self-reported confidence. Bars compare (a) Expected Calibration Error (ECE; $M{=}10$ equal-width bins over $[0,1]$), (b) Brier score, and (c) Wrong@0.90 (error rate among predictions with $p\!\ge\!0.90$). Metrics are computed against frozen correctness labels derived from the LLM–judge at $\tau^\star{=}0.66$. claude-sonnet-4 is best-calibrated overall (ECE ${=}0.206$, Brier ${=}0.254$) and has the lowest high-confidence error (Wrong@0.90 ${=}14.9\%$), whereas Qwen3-235B-A22B-FP8 is most miscalibrated (ECE ${=}0.417$, Brier ${=}0.416$, Wrong@0.90 ${=}47.7\%$). Results aggregate $N{=}2{,}817$ predictions per model; lower is better for all three metrics.
  • Figure 3: Calibration–accuracy trade-off across models. Each point is a model with y-axis accuracy and x-axis ECE (as in Fig. \ref{['fig:calibration-panels']}); the green rectangle highlights the ideal region (low ECE, high accuracy). kimi-k2 attains the highest accuracy ($0.612$) but with moderate ECE ($0.259$), while claude-sonnet-4 lies closest to the ideal corner by combining strong accuracy ($0.603$) with the best ECE ($0.206$). Models with higher ECE tend to suffer lower accuracy (e.g., Qwen3-235B-A22B-FP8: ECE $0.417$, Acc $0.474$), underscoring the need to consider calibration alongside topline accuracy.