Table of Contents
Fetching ...

JuniperLiu at CoMeDi Shared Task: Models as Annotators in Lexical Semantics Disagreements

Zhu Liu, Zhen Hu, Ying Liu

TL;DR

The paper tackles disagreements in lexical semantics by framing Subtask 1 as estimating the mean $\mu$ of a judgment distribution and Subtask 2 as estimating the variance $\sigma^2$, treating each system as a virtual annotator. It combines threshold-based labeling with anisotropy removal and an MLP-based regressor for disagreement, leveraging model ensembling to capture annotator diversity. Results show that anisotropy removal and high-layer representations boost Subtask 1, while STD-based scores on continuous relatedness correlate with human disagreement for Subtask 2, with language-specific ensembling providing additional gains. The work offers a practical framework for simulating annotator diversity in multilingual lexical semantics and provides code at https://github.com/RyanLiut/CoMeDi_Solution.

Abstract

We present the results of our system for the CoMeDi Shared Task, which predicts majority votes (Subtask 1) and annotator disagreements (Subtask 2). Our approach combines model ensemble strategies with MLP-based and threshold-based methods trained on pretrained language models. Treating individual models as virtual annotators, we simulate the annotation process by designing aggregation measures that incorporate continuous relatedness scores and discrete classification labels to capture both majority and disagreement. Additionally, we employ anisotropy removal techniques to enhance performance. Experimental results demonstrate the effectiveness of our methods, particularly for Subtask 2. Notably, we find that standard deviation on continuous relatedness scores among different model manipulations correlates with human disagreement annotations compared to metrics on aggregated discrete labels. The code will be published at https://github.com/RyanLiut/CoMeDi_Solution.

JuniperLiu at CoMeDi Shared Task: Models as Annotators in Lexical Semantics Disagreements

TL;DR

The paper tackles disagreements in lexical semantics by framing Subtask 1 as estimating the mean of a judgment distribution and Subtask 2 as estimating the variance , treating each system as a virtual annotator. It combines threshold-based labeling with anisotropy removal and an MLP-based regressor for disagreement, leveraging model ensembling to capture annotator diversity. Results show that anisotropy removal and high-layer representations boost Subtask 1, while STD-based scores on continuous relatedness correlate with human disagreement for Subtask 2, with language-specific ensembling providing additional gains. The work offers a practical framework for simulating annotator diversity in multilingual lexical semantics and provides code at https://github.com/RyanLiut/CoMeDi_Solution.

Abstract

We present the results of our system for the CoMeDi Shared Task, which predicts majority votes (Subtask 1) and annotator disagreements (Subtask 2). Our approach combines model ensemble strategies with MLP-based and threshold-based methods trained on pretrained language models. Treating individual models as virtual annotators, we simulate the annotation process by designing aggregation measures that incorporate continuous relatedness scores and discrete classification labels to capture both majority and disagreement. Additionally, we employ anisotropy removal techniques to enhance performance. Experimental results demonstrate the effectiveness of our methods, particularly for Subtask 2. Notably, we find that standard deviation on continuous relatedness scores among different model manipulations correlates with human disagreement annotations compared to metrics on aggregated discrete labels. The code will be published at https://github.com/RyanLiut/CoMeDi_Solution.

Paper Structure

This paper contains 28 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Performance of different types of anisotropy removal with the increase of layer index. 0 indicates the input embedding. "abtt" means all-but-the-top.
  • Figure 2: Performance of different models as the layer index increases. The optimal result (Layer 25) for Llama-7B and its standardized version are shown as the upper bound.
  • Figure 3: Performance of three types of measures across 500 random runs.