Table of Contents
Fetching ...

Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

Alimohammad Beigi, Zhen Tan, Nivedh Mudiam, Canyu Chen, Kai Shu, Huan Liu

TL;DR

The paper tackles the challenge of attributing LLM-generated disinformation to its source across varied prompting methods. It reframes the problem as domain generalization and introduces SCLBERT, a supervised-contrastive framework that learns domain-invariant representations to identify the originating LLM even on unseen prompts. Through extensive experiments on the LLMFake dataset with three prompting methods and three LLMs, SCLBERT outperforms strong baselines, especially in out-of-domain settings, and shows clearer, more clustered latent representations. The work demonstrates that domain-invariant, signature-based attribution is feasible and robust, with practical implications for tracing disinformation and improving detection systems; it also outlines avenues for scalability, robustness to adversarial prompting, and interpretability.

Abstract

Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern large language models (LLMs) produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods introduce domain-specific features that can mask the fundamental characteristics of the models. In this paper, we introduce the concept of model attribution as a domain generalization problem, where each prompting method represents a unique domain. We argue that an effective attribution model must be invariant to these domain-specific features. It should also be proficient in identifying the originating models across all scenarios, reflecting real-world detection challenges. To address this, we introduce a novel approach based on Supervised Contrastive Learning. This method is designed to enhance the model's robustness to variations in prompts and focuses on distinguishing between different source LLMs. We evaluate our model through rigorous experiments involving three common prompting methods: ``open-ended'', ``rewriting'', and ``paraphrasing'', and three advanced LLMs: ``llama 2'', ``chatgpt'', and ``vicuna''. Our results demonstrate the effectiveness of our approach in model attribution tasks, achieving state-of-the-art performance across diverse and unseen datasets.

Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

TL;DR

The paper tackles the challenge of attributing LLM-generated disinformation to its source across varied prompting methods. It reframes the problem as domain generalization and introduces SCLBERT, a supervised-contrastive framework that learns domain-invariant representations to identify the originating LLM even on unseen prompts. Through extensive experiments on the LLMFake dataset with three prompting methods and three LLMs, SCLBERT outperforms strong baselines, especially in out-of-domain settings, and shows clearer, more clustered latent representations. The work demonstrates that domain-invariant, signature-based attribution is feasible and robust, with practical implications for tracing disinformation and improving detection systems; it also outlines avenues for scalability, robustness to adversarial prompting, and interpretability.

Abstract

Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern large language models (LLMs) produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods introduce domain-specific features that can mask the fundamental characteristics of the models. In this paper, we introduce the concept of model attribution as a domain generalization problem, where each prompting method represents a unique domain. We argue that an effective attribution model must be invariant to these domain-specific features. It should also be proficient in identifying the originating models across all scenarios, reflecting real-world detection challenges. To address this, we introduce a novel approach based on Supervised Contrastive Learning. This method is designed to enhance the model's robustness to variations in prompts and focuses on distinguishing between different source LLMs. We evaluate our model through rigorous experiments involving three common prompting methods: ``open-ended'', ``rewriting'', and ``paraphrasing'', and three advanced LLMs: ``llama 2'', ``chatgpt'', and ``vicuna''. Our results demonstrate the effectiveness of our approach in model attribution tasks, achieving state-of-the-art performance across diverse and unseen datasets.
Paper Structure (18 sections, 5 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 5 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: A schematic diagram of model attribution in LLM-generated disinformation.
  • Figure 2: SCL Architecture.
  • Figure 3: T-SNE visualization of 256-dimensional sentence embeddings $z$ for each model. The models are trained using O and R as source domains, and P as the target domain. For the visualization, 500 examples are sampled from each domain.
  • Figure 4: Investigating the Capacity of ChatGPT 4 for Attributing Disinformation Origin: An Analysis Using In-Context Learning Examples. This figure illustrates ChatGPT 4's potential in identifying the source of LLM-generated disinformation by analyzing the distinct writing styles of various language models through in-context learning. Examples showcase both successful and unsuccessful attributions, highlighting the method's reliance on the quality and representativeness of the input examples