Table of Contents
Fetching ...

Dice Loss for Data-imbalanced NLP Tasks

Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, Jiwei Li

TL;DR

The paper tackles data imbalance in NLP by replacing standard cross-entropy with dice-based losses (Sørensen–Dice and Tversky) and introducing a self-adjusting weight that downplays easy negatives. The authors provide theoretical justification linking training objectives to F1 and demonstrate substantial improvements across POS tagging, NER, MRC, and PI, including state-of-the-art results on several benchmarks. Ablation studies show the approach is most beneficial under high imbalance, while it is not advantageous for plain accuracy-focused tasks like SST. Overall, the method offers a practical, architecture-agnostic way to align training with F1-oriented evaluation in imbalanced NLP settings and highlights the importance of hyperparameter choices in Tversky-based formulations.

Abstract

Many NLP tasks such as tagging and machine reading comprehension are faced with the severe data imbalance issue: negative examples significantly outnumber positive examples, and the huge number of background examples (or easy-negative examples) overwhelms the training. The most commonly used cross entropy (CE) criteria is actually an accuracy-oriented objective, and thus creates a discrepancy between training and test: at training time, each training instance contributes equally to the objective function, while at test time F1 score concerns more about positive examples. In this paper, we propose to use dice loss in replacement of the standard cross-entropy objective for data-imbalanced NLP tasks. Dice loss is based on the Sorensen-Dice coefficient or Tversky index, which attaches similar importance to false positives and false negatives, and is more immune to the data-imbalance issue. To further alleviate the dominating influence from easy-negative examples in training, we propose to associate training examples with dynamically adjusted weights to deemphasize easy-negative examples.Theoretical analysis shows that this strategy narrows down the gap between the F1 score in evaluation and the dice loss in training. With the proposed training objective, we observe significant performance boost on a wide range of data imbalanced NLP tasks. Notably, we are able to achieve SOTA results on CTB5, CTB6 and UD1.4 for the part of speech tagging task; SOTA results on CoNLL03, OntoNotes5.0, MSRA and OntoNotes4.0 for the named entity recognition task; along with competitive results on the tasks of machine reading comprehension and paraphrase identification.

Dice Loss for Data-imbalanced NLP Tasks

TL;DR

The paper tackles data imbalance in NLP by replacing standard cross-entropy with dice-based losses (Sørensen–Dice and Tversky) and introducing a self-adjusting weight that downplays easy negatives. The authors provide theoretical justification linking training objectives to F1 and demonstrate substantial improvements across POS tagging, NER, MRC, and PI, including state-of-the-art results on several benchmarks. Ablation studies show the approach is most beneficial under high imbalance, while it is not advantageous for plain accuracy-focused tasks like SST. Overall, the method offers a practical, architecture-agnostic way to align training with F1-oriented evaluation in imbalanced NLP settings and highlights the importance of hyperparameter choices in Tversky-based formulations.

Abstract

Many NLP tasks such as tagging and machine reading comprehension are faced with the severe data imbalance issue: negative examples significantly outnumber positive examples, and the huge number of background examples (or easy-negative examples) overwhelms the training. The most commonly used cross entropy (CE) criteria is actually an accuracy-oriented objective, and thus creates a discrepancy between training and test: at training time, each training instance contributes equally to the objective function, while at test time F1 score concerns more about positive examples. In this paper, we propose to use dice loss in replacement of the standard cross-entropy objective for data-imbalanced NLP tasks. Dice loss is based on the Sorensen-Dice coefficient or Tversky index, which attaches similar importance to false positives and false negatives, and is more immune to the data-imbalance issue. To further alleviate the dominating influence from easy-negative examples in training, we propose to associate training examples with dynamically adjusted weights to deemphasize easy-negative examples.Theoretical analysis shows that this strategy narrows down the gap between the F1 score in evaluation and the dice loss in training. With the proposed training objective, we observe significant performance boost on a wide range of data imbalanced NLP tasks. Notably, we are able to achieve SOTA results on CTB5, CTB6 and UD1.4 for the part of speech tagging task; SOTA results on CoNLL03, OntoNotes5.0, MSRA and OntoNotes4.0 for the named entity recognition task; along with competitive results on the tasks of machine reading comprehension and paraphrase identification.

Paper Structure

This paper contains 39 sections, 12 equations, 10 tables.