Table of Contents
Fetching ...

DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang, Chunmei Zhu, Xiaochun Cao

Abstract

Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth'' television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.

DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

Abstract

Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth'' television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.
Paper Structure (22 sections, 10 equations, 9 figures, 7 tables)

This paper contains 22 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison between traditional and our auditable method. (a) Traditional methods map behavioral signals directly to binary labels, providing no explanation for the decision. (b) Our method generates structured reports with explicit audiovisual cues and reasoning, creating an audit trail from raw data to final prediction.
  • Figure 2: Overview of reasoning dataset generation pipeline. The pipeline adopts a Human-in-the-Loop (HITL) framework to ensure high-quality, auditable structured report. It begins with AI-driven audiovisual cue extraction, followed by human-guided rectification of hallucinations. A reasoning assistant then synthesizes these cues into forensic judgments. The data is further enriched through semantic augmentation and a multi-tiered filtering stage (comprising AI, rules-based, and CLIP-similarity checks) to produce the final high-fidelity benchmark.
  • Figure 3: Dataset statistics of T4-Deception. We illustrate: (a) distribution of identities, where each one of total 565 identities is shared by one truthful and two deceptive participants; (b) balanced gender distribution; and (c) numerous short-term deceptive segments with an an average temporal duration of 3.65s.
  • Figure 4: Overview of our auditable audiovisual deception detection framework. A video encoder and an audio encoder extract modality features, followed by a fusion module that produces a robust representation. Inside the encoder/fusion stage, we integrate two mechanisms: (1) Stabilized Individuality-Commonality Synergy (SICS) that combines a shared baseline with a sample-specific residual via gated fusion (with a light stability regularizer); (2) Distilled Modality Consistency (DMC) that discourages unimodal dominance by penalizing high-confidence cross-modal conflict through agreement regularization on modality-specific predictive distributions. A report generator then produces a single-line, schema-constrained report (Video Cues; Audio Cues; Reasoning; Prediction), which serves as a standardized audit artifact.
  • Figure 5: Projector gradient dynamics analysis during training.
  • ...and 4 more figures