Table of Contents
Fetching ...

Making Sentence Embeddings Robust to User-Generated Content

Lydia Nishimwe, Benoît Sagot, Rachel Bawden

TL;DR

This work tackles the brittleness of sentence embeddings when faced with user-generated content (UGC) by introducing RoLASER, a teacher–student English encoder that aligns standard and non-standard sentences in the embedding space. Trained with standard English data and synthetic UGC, RoLASER reduces the standard–UGC distance without harming standard-data performance, achieving notable gains on natural and artificial UGC benchmarks. The approach yields up to $2\times$ better xSIM on natural UGC and up to $11\times$ improvements on artificial UGC, with pronounced benefits for character-level perturbations; extrinsic evaluations show comparable or improved performance on standard tasks and clear advantages on UGC data. Overall, RoLASER enables more robust, cross-linguistic and cross-modal NLP involving UGC, and the findings suggest fruitful directions for extending the technique to more languages and architectures.

Abstract

NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We show that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER's robustness to both natural and artificial UGC data by achieving up to 2x and 11x better scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.

Making Sentence Embeddings Robust to User-Generated Content

TL;DR

This work tackles the brittleness of sentence embeddings when faced with user-generated content (UGC) by introducing RoLASER, a teacher–student English encoder that aligns standard and non-standard sentences in the embedding space. Trained with standard English data and synthetic UGC, RoLASER reduces the standard–UGC distance without harming standard-data performance, achieving notable gains on natural and artificial UGC benchmarks. The approach yields up to better xSIM on natural UGC and up to improvements on artificial UGC, with pronounced benefits for character-level perturbations; extrinsic evaluations show comparable or improved performance on standard tasks and clear advantages on UGC data. Overall, RoLASER enables more robust, cross-linguistic and cross-modal NLP involving UGC, and the findings suggest fruitful directions for extending the technique to more languages and architectures.

Abstract

NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We show that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER's robustness to both natural and artificial UGC data by achieving up to 2x and 11x better scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.
Paper Structure (33 sections, 1 equation, 6 figures, 9 tables)

This paper contains 33 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Teacher-Student approach.
  • Figure 2: Visualisation of the first 2 principal components of the LASER space. The points represent the embeddings of a UGC sentence from RoCS-MT, its standardised version (std), and its translations into other languages (tra).
  • Figure 3: Distribution of transformations obtained by applying mix_all on 2M training sentences.
  • Figure 4: Visualisation of UGC phenomena of the FLORES$\dagger$ devtest by their type and token ratios. The data point labels indicate TTR ratios. All ratios are with respect to the standard English text.
  • Figure 5: Quantiles of average pairwise cosine distance on FLORES devtest for all 199 xx$\rightarrow$English language pairs.
  • ...and 1 more figures