Can Deception Detection Go Deeper? Dataset, Evaluation, and Benchmark for Deception Reasoning
Kang Chen, Zheng Lian, Haiyang Sun, Rui Liu, Jiangyan Yi, Bin Liu, Jianhua Tao
TL;DR
This paper introduces deception reasoning as a substantive extension of deception detection, focusing on objective, evidence-backed explanations for potential lies. It presents a data-generation pipeline that uses legal instruments to synthesize open-ended deception-reasoning dialogues via GPT-4, complemented by a two-stage target-content and action-extraction process. Four evaluation metrics—accuracy, completeness, logic, and depth—are proposed and applied through automatic and human evaluators to benchmark large language models, revealing progress in Chinese LLM reasoning and the feasibility of synthetic dialogue as a cost-effective dataset, while also assessing dialogue naturalness. The work provides a practical benchmark for LLM reasoning in high-stakes interrogation scenarios and lays groundwork for expanding to multimodal deception reasoning with real dialogues in the future.
Abstract
Deception detection has attracted increasing attention due to its importance in real-world scenarios. Its main goal is to detect deceptive behaviors from multimodal clues such as gestures, facial expressions, prosody, etc. However, these bases are usually subjective and related to personal habits. Therefore, we extend deception detection to deception reasoning, further providing objective evidence to support subjective judgment. Specifically, we provide potential lies and basic facts and then analyze why this sentence may be a lie by combining factual inconsistencies and intent behind them. Compared with deception detection, this task is more applicable to real-world scenarios. For example, in interrogation, the police should judge whether a person is lying based on solid evidence. This paper presents our initial attempts at this task, including constructing a dataset and defining evaluation metrics. Meanwhile, this task can serve as a benchmark for evaluating the complex reasoning capability of large language models. Our code and data are provided in the supplementary material.
