A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models
Yuantao Zhang, Zhankui Yang
TL;DR
This work tackles the problem of detecting copyright infringement and copying among Large Language Models by proposing a geometry-informed similarity metric that combines perplexity-change profiles with Menger curvature. By defining $\Delta\text{PPL}(w_i)$ around each word and connecting model differences to curvature, the authors derive an upper bound on similarity that aligns with observed relationships across related models. Through preliminary and main experiments on open-source LLMs across Wikipedia, medical, and legal domains, the method outperforms or matches baselines like Similarity Approximation and Jensen-Shannon Divergence while offering robustness and scalability, including in simulated copying scenarios with noise-added parameters. The approach yields practical copying-detection thresholds and demonstrates broad applicability, suggesting a viable, domain-general tool for preserving originality and integrity in LLM deployments. \(Code are available at the provided GitHub repository.\)
Abstract
The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at https://github.com/zyttt-coder/LLM_similarity.
