Table of Contents
Fetching ...

Can Deception Detection Go Deeper? Dataset, Evaluation, and Benchmark for Deception Reasoning

Kang Chen, Zheng Lian, Haiyang Sun, Rui Liu, Jiangyan Yi, Bin Liu, Jianhua Tao

TL;DR

This paper introduces deception reasoning as a substantive extension of deception detection, focusing on objective, evidence-backed explanations for potential lies. It presents a data-generation pipeline that uses legal instruments to synthesize open-ended deception-reasoning dialogues via GPT-4, complemented by a two-stage target-content and action-extraction process. Four evaluation metrics—accuracy, completeness, logic, and depth—are proposed and applied through automatic and human evaluators to benchmark large language models, revealing progress in Chinese LLM reasoning and the feasibility of synthetic dialogue as a cost-effective dataset, while also assessing dialogue naturalness. The work provides a practical benchmark for LLM reasoning in high-stakes interrogation scenarios and lays groundwork for expanding to multimodal deception reasoning with real dialogues in the future.

Abstract

Deception detection has attracted increasing attention due to its importance in real-world scenarios. Its main goal is to detect deceptive behaviors from multimodal clues such as gestures, facial expressions, prosody, etc. However, these bases are usually subjective and related to personal habits. Therefore, we extend deception detection to deception reasoning, further providing objective evidence to support subjective judgment. Specifically, we provide potential lies and basic facts and then analyze why this sentence may be a lie by combining factual inconsistencies and intent behind them. Compared with deception detection, this task is more applicable to real-world scenarios. For example, in interrogation, the police should judge whether a person is lying based on solid evidence. This paper presents our initial attempts at this task, including constructing a dataset and defining evaluation metrics. Meanwhile, this task can serve as a benchmark for evaluating the complex reasoning capability of large language models. Our code and data are provided in the supplementary material.

Can Deception Detection Go Deeper? Dataset, Evaluation, and Benchmark for Deception Reasoning

TL;DR

This paper introduces deception reasoning as a substantive extension of deception detection, focusing on objective, evidence-backed explanations for potential lies. It presents a data-generation pipeline that uses legal instruments to synthesize open-ended deception-reasoning dialogues via GPT-4, complemented by a two-stage target-content and action-extraction process. Four evaluation metrics—accuracy, completeness, logic, and depth—are proposed and applied through automatic and human evaluators to benchmark large language models, revealing progress in Chinese LLM reasoning and the feasibility of synthetic dialogue as a cost-effective dataset, while also assessing dialogue naturalness. The work provides a practical benchmark for LLM reasoning in high-stakes interrogation scenarios and lays groundwork for expanding to multimodal deception reasoning with real dialogues in the future.

Abstract

Deception detection has attracted increasing attention due to its importance in real-world scenarios. Its main goal is to detect deceptive behaviors from multimodal clues such as gestures, facial expressions, prosody, etc. However, these bases are usually subjective and related to personal habits. Therefore, we extend deception detection to deception reasoning, further providing objective evidence to support subjective judgment. Specifically, we provide potential lies and basic facts and then analyze why this sentence may be a lie by combining factual inconsistencies and intent behind them. Compared with deception detection, this task is more applicable to real-world scenarios. For example, in interrogation, the police should judge whether a person is lying based on solid evidence. This paper presents our initial attempts at this task, including constructing a dataset and defining evaluation metrics. Meanwhile, this task can serve as a benchmark for evaluating the complex reasoning capability of large language models. Our code and data are provided in the supplementary material.
Paper Structure (21 sections, 5 figures, 10 tables)

This paper contains 21 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Pipeline of dialogue generation based on legal instruments.
  • Figure 2: Distribution of lengths after selection (the length refers to the number of Chinese characters).
  • Figure 3: Example of time masking process.
  • Figure 4: Generated dialogue, potential lie (in the red box), and reasoning results using examples in Table \ref{['Table1']}.
  • Figure 5: Distribution of target content length, number of actions, and dialogue turns.