Table of Contents
Fetching ...

Domain-Independent Deception: A New Taxonomy and Linguistic Analysis

Rakesh M. Verma, Nachum Dershowitz, Victor Zeng, Dainis Boumber, Xuting Liu

TL;DR

This paper tackles deception detection across multiple domains by introducing a formal, domain-agnostic framework. It defines deception with a probabilistic, exposure-based criterion and presents a multi-dimensional taxonomy that includes agents, stratagems, goals, and exposure, along with dimensions like motivation and modality. Through linguistic cue analysis across real-world deception datasets and extensive deep-learning experiments with transformer models, the authors find both universal and domain-specific signals, highlighting substantial cross-domain transfer potential but also limitations in generalization without diverse, domain-spanning data. The work also provides guidelines for rigorous systematic reviews and critiques past claims of universal linguistic cues, aiming to stabilize the field and pave the way for robust, domain-independent deception detectors.

Abstract

Internet-based economies and societies are drowning in deceptive attacks. These attacks take many forms, such as fake news, phishing, and job scams, which we call ``domains of deception.'' Machine-learning and natural-language-processing researchers have been attempting to ameliorate this precarious situation by designing domain-specific detectors. Only a few recent works have considered domain-independent deception. We collect these disparate threads of research and investigate domain-independent deception. First, we provide a new computational definition of deception and break down deception into a new taxonomy. Then, we analyze the debate on linguistic cues for deception and supply guidelines for systematic reviews. Finally, we investigate common linguistic features and give evidence for knowledge transfer across different forms of deception.

Domain-Independent Deception: A New Taxonomy and Linguistic Analysis

TL;DR

This paper tackles deception detection across multiple domains by introducing a formal, domain-agnostic framework. It defines deception with a probabilistic, exposure-based criterion and presents a multi-dimensional taxonomy that includes agents, stratagems, goals, and exposure, along with dimensions like motivation and modality. Through linguistic cue analysis across real-world deception datasets and extensive deep-learning experiments with transformer models, the authors find both universal and domain-specific signals, highlighting substantial cross-domain transfer potential but also limitations in generalization without diverse, domain-spanning data. The work also provides guidelines for rigorous systematic reviews and critiques past claims of universal linguistic cues, aiming to stabilize the field and pave the way for robust, domain-independent deception detectors.

Abstract

Internet-based economies and societies are drowning in deceptive attacks. These attacks take many forms, such as fake news, phishing, and job scams, which we call ``domains of deception.'' Machine-learning and natural-language-processing researchers have been attempting to ameliorate this precarious situation by designing domain-specific detectors. Only a few recent works have considered domain-independent deception. We collect these disparate threads of research and investigate domain-independent deception. First, we provide a new computational definition of deception and break down deception into a new taxonomy. Then, we analyze the debate on linguistic cues for deception and supply guidelines for systematic reviews. Finally, we investigate common linguistic features and give evidence for knowledge transfer across different forms of deception.
Paper Structure (40 sections, 1 equation, 8 figures, 11 tables)

This paper contains 40 sections, 1 equation, 8 figures, 11 tables.

Figures (8)

  • Figure 1: The proposed deception taxonomy -- the full manipulation (or stratagem) and motivation subtrees are not shown.
  • Figure 2: Graph showing the 17 papers as vertices. There is an edge between papers that have a common author. The thickness and color reflect the number of common authors. $K_3$ and $K_4$ cliques are visible.
  • Figure 3: Random Forest performance for the five feature types: linguistic (ling), function words (fw), pos tags of function words (fw-pos), combination of the three (all), and unigram tfidf (baseline); F -- fake news; J -- job scams; P -- phishing; Pr -- product reviews; Ps -- political statements.
  • Figure 4: Model used for deep learning experiments.
  • Figure 5: Pairwise F1 score scatter matrix of converged combined models. Outliers are excluded.
  • ...and 3 more figures

Theorems & Definitions (4)

  • definition 1: Preliminary
  • definition 2: Deception
  • definition 3: Computational Deception -- Formalized
  • definition 4: Computational Deception -- Quantified