CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Christos Tzouvaras; Konstantinos Skianis; Athanasios Voulodimos

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos

Abstract

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Abstract

Paper Structure (28 sections, 6 equations, 5 figures, 9 tables)

This paper contains 28 sections, 6 equations, 5 figures, 9 tables.

Introduction
Background
Task Setup
Related Work
System Overview
Dual-Model Ensemble
Deliberative Complexity Gating (DCG)
Experimental Setup
Implementation and Dependencies
Inference Hyperparameters
Prompting Strategy
Evaluation Measures
Results
Evaluation
Ablation Study
...and 13 more sections

Figures (5)

Figure 1: Task label taxonomy by thomas2024isaidthatdataset (evasion labels mapped to 3 clarity classes).
Figure 2: Architecture of our two-stage pipeline. Stage 1 runs Grok and Gemini with $k=5$ self-consistency (SC), maps evasion to clarity, and applies asymmetric weighted voting. Stage 2 (DCG) uses Gemini response length and Grok consistency to gate uncertain cases.
Figure A1: DCG sensitivity to percentile choice. Macro-F1 as $\theta_1$ moves from the 5th to the 95th percentile on Development and Evaluation.
Figure A2: Gemini response length by gold clarity class. Ambivalent samples are generally longer than clear classes on both splits.
Figure A3: Agreement $\times$ consistency heatmap (post-DCG). Accuracy by Grok--Gemini agreement and Grok self-consistency bins on Development and Evaluation sets. The dashed red outline highlights disagreement with medium/low-consistency cells as a diagnostic focus region.

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Abstract

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Authors

Abstract

Table of Contents

Figures (5)