Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels
Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki
TL;DR
This work tackles the limitation of single-label bias evaluation in NLI by introducing NLI-CoAL, a framework that leverages all three NLI labels (entailment, contradiction, neutral) to assess gender bias. It defines three data groups (PS, AS, NS) and a corresponding bias score $s = \frac{e_p + c_a + (1 - n_n)}{3}$, validated against a baseline that uses only neutral outputs. The authors construct multilingual evaluation datasets in English, Japanese, and Chinese, and perform a meta-evaluation showing that NLI-CoAL more accurately distinguishes biased inferences from non-biased errors. Experimental results across English, Japanese, and Chinese PLMs reveal language-specific bias patterns and demonstrate the method’s cross-language compatibility, highlighting potential gaps in Chinese NLI learning. Overall, NLI-CoAL provides a more nuanced, task-specific, and language-backed approach to measuring bias in NLI models, with practical implications for fairer NLP systems.
Abstract
Discriminatory gender biases have been found in Pre-trained Language Models (PLMs) for multiple languages. In Natural Language Inference (NLI), existing bias evaluation methods have focused on the prediction results of one specific label out of three labels, such as neutral. However, such evaluation methods can be inaccurate since unique biased inferences are associated with unique prediction labels. Addressing this limitation, we propose a bias evaluation method for PLMs, called NLI-CoAL, which considers all the three labels of NLI task. First, we create three evaluation data groups that represent different types of biases. Then, we define a bias measure based on the corresponding label output of each data group. In the experiments, we introduce a meta-evaluation technique for NLI bias measures and use it to confirm that our bias measure can distinguish biased, incorrect inferences from non-biased incorrect inferences better than the baseline, resulting in a more accurate bias evaluation. We create the datasets in English, Japanese, and Chinese, and successfully validate the compatibility of our bias measure across multiple languages. Lastly, we observe the bias tendencies in PLMs of different languages. To our knowledge, we are the first to construct evaluation datasets and measure PLMs' bias from NLI in Japanese and Chinese.
