Large Language Model Sourcing: A Survey

Liang Pang; Jia Gu; Sunhao Dai; Zihao Wei; Zenghao Duan; Kangxi Wu; Zhiyi Yin; Jun Xu; Huawei Shen; Xueqi Cheng

Large Language Model Sourcing: A Survey

Liang Pang, Jia Gu, Sunhao Dai, Zihao Wei, Zenghao Duan, Kangxi Wu, Zhiyi Yin, Jun Xu, Huawei Shen, Xueqi Cheng

TL;DR

This survey introduces a unified four‑dimensional provenance framework for LLMs (Model, Model Structure, Training Data, External Data) and a dual‑paradigm taxonomy (prior‑based vs posterior‑based) to enable proactive traceability and retrospective attribution across the full lifecycle of generated content. It surveys methods, datasets, and evaluation protocols for each dimension, detailing white‑box and black‑box approaches as well as watermarking and circuit‑level analyses that reveal how architecture and data shape outputs. The work highlights the trade‑offs between explicit, marker‑based attribution and flexible, post‑hoc inference, and it outlines methodological and governance challenges toward trustworthy, accountable AI systems. The practical impact lies in guiding developers and regulators to implement robust provenance mechanisms—ranging from watermarks and embedded markers to evidence retrieval and citation protocols—that bolster transparency, safety, and fairness in real‑world LLM deployments.

Abstract

Due to the black-box nature of large language models (LLMs) and the realism of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement have become significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation organized around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Moreover, a unified dual-paradigm taxonomy is proposed that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

Large Language Model Sourcing: A Survey

TL;DR

Abstract

Large Language Model Sourcing: A Survey

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)