Table of Contents
Fetching ...

RETVec: Resilient and Efficient Text Vectorizer

Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, Alexey Kurakin

TL;DR

RETVec introduces a multilingual, robust text vectorizer that fuses a novel UTF-8 character encoder with a compact embedding model trained via pair-wise metric learning. The design emphasizes resilience to typos and character-level adversarial attacks while maintaining speed and low memory usage suitable for on-device deployment. Across speed, classification, typo resilience, adversarial attacks, and pre-training experiments, RETVec consistently matches or outperforms traditional vectorizers like SentencePiece, BPE, and fastText, with notable gains in multilingual and adversarial contexts. The work demonstrates practical benefits for real-world NLP systems and suggests promising avenues for integrating RETVec into smaller language models and generation pipelines, albeit with challenges for decoder-only generation that warrant future research.

Abstract

This paper describes RETVec, an efficient, resilient, and multilingual text vectorizer designed for neural-based text processing. RETVec combines a novel character encoding with an optional small embedding model to embed words into a 256-dimensional vector space. The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks. In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.

RETVec: Resilient and Efficient Text Vectorizer

TL;DR

RETVec introduces a multilingual, robust text vectorizer that fuses a novel UTF-8 character encoder with a compact embedding model trained via pair-wise metric learning. The design emphasizes resilience to typos and character-level adversarial attacks while maintaining speed and low memory usage suitable for on-device deployment. Across speed, classification, typo resilience, adversarial attacks, and pre-training experiments, RETVec consistently matches or outperforms traditional vectorizers like SentencePiece, BPE, and fastText, with notable gains in multilingual and adversarial contexts. The work demonstrates practical benefits for real-world NLP systems and suggests promising avenues for integrating RETVec into smaller language models and generation pipelines, albeit with challenges for decoder-only generation that warrant future research.

Abstract

This paper describes RETVec, an efficient, resilient, and multilingual text vectorizer designed for neural-based text processing. RETVec combines a novel character encoding with an optional small embedding model to embed words into a 256-dimensional vector space. The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks. In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.
Paper Structure (55 sections, 7 figures, 21 tables)

This paper contains 55 sections, 7 figures, 21 tables.

Figures (7)

  • Figure 1: RETVec architecture overview - the output shape of each layer is in parenthesis. The clen indicates the number of characters used per word - 16 characters by default. The batch and word sequence length dimensions are omitted.
  • Figure 2: The cosine similarity distributions of RETVec embeddings for 1000 pairs of augmented and non-augmented versions of words, selected languages shown. 'Random' language refers to randomly-generated UTF-8 strings.
  • Figure 3: Classification performance on Multilingual Amazon Reviews broken down by language.
  • Figure 4: Comparison of various tokenizers' resilience against mixed random typos (left) and common human typos (right) when training classification models from scratch.
  • Figure 5: Comparison of RETVec resilience against various types of typos. RETVec-raw on the left, RETVec on the right.
  • ...and 2 more figures