Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator
Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
TL;DR
This work formalizes self-bias in LLM-generated benchmarks and decomposes it into three sub-biases. It introduces Silencer, a framework that combines sample-level mitigations—Attribute Integration, Cross Paraphrase, and Label Calibration—with a bias-neutralizing ensemble at the benchmark level to suppress self-bias. Empirical results across multiple tasks and models show substantial reductions in self-bias and substantial gains in alignment with human-annotated benchmarks, with average Pearson correlation improving from 0.655 to 0.833. The approach demonstrates good generalizability and provides practical guidance on leveraging multiple generators to produce more reliable benchmarks for evaluating LLMs.
Abstract
LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
