Table of Contents
Fetching ...

A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models

Yuantao Zhang, Zhankui Yang

TL;DR

This work tackles the problem of detecting copyright infringement and copying among Large Language Models by proposing a geometry-informed similarity metric that combines perplexity-change profiles with Menger curvature. By defining $\Delta\text{PPL}(w_i)$ around each word and connecting model differences to curvature, the authors derive an upper bound on similarity that aligns with observed relationships across related models. Through preliminary and main experiments on open-source LLMs across Wikipedia, medical, and legal domains, the method outperforms or matches baselines like Similarity Approximation and Jensen-Shannon Divergence while offering robustness and scalability, including in simulated copying scenarios with noise-added parameters. The approach yields practical copying-detection thresholds and demonstrates broad applicability, suggesting a viable, domain-general tool for preserving originality and integrity in LLM deployments. \(Code are available at the provided GitHub repository.\)

Abstract

The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at https://github.com/zyttt-coder/LLM_similarity.

A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models

TL;DR

This work tackles the problem of detecting copyright infringement and copying among Large Language Models by proposing a geometry-informed similarity metric that combines perplexity-change profiles with Menger curvature. By defining around each word and connecting model differences to curvature, the authors derive an upper bound on similarity that aligns with observed relationships across related models. Through preliminary and main experiments on open-source LLMs across Wikipedia, medical, and legal domains, the method outperforms or matches baselines like Similarity Approximation and Jensen-Shannon Divergence while offering robustness and scalability, including in simulated copying scenarios with noise-added parameters. The approach yields practical copying-detection thresholds and demonstrates broad applicability, suggesting a viable, domain-general tool for preserving originality and integrity in LLM deployments.

Abstract

The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at https://github.com/zyttt-coder/LLM_similarity.

Paper Structure

This paper contains 21 sections, 19 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Perplexity curve of gpt2 on the text "Turkish (Türkçe) is a language officially spoken in Turkey and Northern Cyprus." Each point represents the perplexity of the sequence from word index 0 to its respective word index.
  • Figure 2: Perplexity curves of Llama7B, Vicuna7B, and gpt-neo-125M on the text "Associations between age and gray matter volume in anatomical brain networks in middle-aged to older adults."
  • Figure 3: Similarity between noised or fine-tuned models and the base LLM. The base LLMs for the left and right subfigures are Llama7B and Pythia-6.9b, respectively.