Table of Contents
Fetching ...

Large Language Model Sourcing: A Survey

Liang Pang, Jia Gu, Sunhao Dai, Zihao Wei, Zenghao Duan, Kangxi Wu, Zhiyi Yin, Jun Xu, Huawei Shen, Xueqi Cheng

TL;DR

This survey introduces a unified four‑dimensional provenance framework for LLMs (Model, Model Structure, Training Data, External Data) and a dual‑paradigm taxonomy (prior‑based vs posterior‑based) to enable proactive traceability and retrospective attribution across the full lifecycle of generated content. It surveys methods, datasets, and evaluation protocols for each dimension, detailing white‑box and black‑box approaches as well as watermarking and circuit‑level analyses that reveal how architecture and data shape outputs. The work highlights the trade‑offs between explicit, marker‑based attribution and flexible, post‑hoc inference, and it outlines methodological and governance challenges toward trustworthy, accountable AI systems. The practical impact lies in guiding developers and regulators to implement robust provenance mechanisms—ranging from watermarks and embedded markers to evidence retrieval and citation protocols—that bolster transparency, safety, and fairness in real‑world LLM deployments.

Abstract

Due to the black-box nature of large language models (LLMs) and the realism of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement have become significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation organized around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Moreover, a unified dual-paradigm taxonomy is proposed that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

Large Language Model Sourcing: A Survey

TL;DR

This survey introduces a unified four‑dimensional provenance framework for LLMs (Model, Model Structure, Training Data, External Data) and a dual‑paradigm taxonomy (prior‑based vs posterior‑based) to enable proactive traceability and retrospective attribution across the full lifecycle of generated content. It surveys methods, datasets, and evaluation protocols for each dimension, detailing white‑box and black‑box approaches as well as watermarking and circuit‑level analyses that reveal how architecture and data shape outputs. The work highlights the trade‑offs between explicit, marker‑based attribution and flexible, post‑hoc inference, and it outlines methodological and governance challenges toward trustworthy, accountable AI systems. The practical impact lies in guiding developers and regulators to implement robust provenance mechanisms—ranging from watermarks and embedded markers to evidence retrieval and citation protocols—that bolster transparency, safety, and fairness in real‑world LLM deployments.

Abstract

Due to the black-box nature of large language models (LLMs) and the realism of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement have become significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation organized around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Moreover, a unified dual-paradigm taxonomy is proposed that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

Paper Structure

This paper contains 48 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Four sourcing dimensions in LLMs: from the model side, outputs derive from the specific model (model sourcing) or its internal architecture and mechanisms (structure sourcing); from the data side, outputs trace to training samples (training data sourcing) or to external corpora (external data sourcing).
  • Figure 2: A taxonomy of large language model sourcing methodologies.
  • Figure 3: The methods for model sourcing is divided into two categories: posterior-based methods and prior-based methods. Within posterior-based sourcing, there are two approaches: white box and black box. For prior-based sourcing, the approaches are divided into parameter-embedded, generative sampling and sampling-based watermark.
  • Figure 4: The methods for structural sourcing can be categorized into two types: posterior-based and prior-based. Within posterior-based structural sourcing, there are two approaches: single-modular and connection-modular. For prior-based structural sourcing, the approaches are divided into model-level and layer-level.
  • Figure 5: Training data sourcing methods are categorized into two types: posterior-based and prior-based. Posterior-based training data sourcing includes two approaches: white box and black box. Prior-based training data sourcing is divided into watermark-based and proxy model methods.
  • ...and 1 more figures