Table of Contents
Fetching ...

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos

Abstract

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Abstract

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.
Paper Structure (28 sections, 6 equations, 5 figures, 9 tables)

This paper contains 28 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Task label taxonomy by thomas2024isaidthatdataset (evasion labels mapped to 3 clarity classes).
  • Figure 2: Architecture of our two-stage pipeline. Stage 1 runs Grok and Gemini with $k=5$ self-consistency (SC), maps evasion to clarity, and applies asymmetric weighted voting. Stage 2 (DCG) uses Gemini response length and Grok consistency to gate uncertain cases.
  • Figure A1: DCG sensitivity to percentile choice. Macro-F1 as $\theta_1$ moves from the 5th to the 95th percentile on Development and Evaluation.
  • Figure A2: Gemini response length by gold clarity class. Ambivalent samples are generally longer than clear classes on both splits.
  • Figure A3: Agreement $\times$ consistency heatmap (post-DCG). Accuracy by Grok--Gemini agreement and Grok self-consistency bins on Development and Evaluation sets. The dashed red outline highlights disagreement with medium/low-consistency cells as a diagnostic focus region.