Table of Contents
Fetching ...

Self-StrAE at SemEval-2024 Task 1: Making Self-Structuring AutoEncoders Learn More With Less

Mattia Opper, N. Siddharth

TL;DR

This paper addresses data- and parameter-efficient learning of semantic embeddings by improving Self-StrAE, a self-structuring autoencoder that builds hierarchical representations. It demonstrates two concrete enhancements: incorporating discrete leaf reconstruction via cross-entropy and dramatically increasing the number of independent channels (with smaller channel size) to reduce non-embedding parameters, achieving strong performance with as little as 10M pretraining tokens. The CECO objective (combining cross-entropy leaf reconstruction and contrastive node-level signals) consistently outperforms single-objective variants across English, Spanish, and Afrikaans, including cross-language transfer to related tasks. Overall, the work highlights the potential of explicit hierarchical inductive bias for resource-efficient NLP, with practical implications for low-resource languages and scalable semantic representations.

Abstract

This paper presents two simple improvements to the Self-Structuring AutoEncoder (Self-StrAE). Firstly, we show that including reconstruction to the vocabulary as an auxiliary objective improves representation quality. Secondly, we demonstrate that increasing the number of independent channels leads to significant improvements in embedding quality, while simultaneously reducing the number of parameters. Surprisingly, we demonstrate that this trend can be followed to the extreme, even to point of reducing the total number of non-embedding parameters to seven. Our system can be pre-trained from scratch with as little as 10M tokens of input data, and proves effective across English, Spanish and Afrikaans.

Self-StrAE at SemEval-2024 Task 1: Making Self-Structuring AutoEncoders Learn More With Less

TL;DR

This paper addresses data- and parameter-efficient learning of semantic embeddings by improving Self-StrAE, a self-structuring autoencoder that builds hierarchical representations. It demonstrates two concrete enhancements: incorporating discrete leaf reconstruction via cross-entropy and dramatically increasing the number of independent channels (with smaller channel size) to reduce non-embedding parameters, achieving strong performance with as little as 10M pretraining tokens. The CECO objective (combining cross-entropy leaf reconstruction and contrastive node-level signals) consistently outperforms single-objective variants across English, Spanish, and Afrikaans, including cross-language transfer to related tasks. Overall, the work highlights the potential of explicit hierarchical inductive bias for resource-efficient NLP, with practical implications for low-resource languages and scalable semantic representations.

Abstract

This paper presents two simple improvements to the Self-Structuring AutoEncoder (Self-StrAE). Firstly, we show that including reconstruction to the vocabulary as an auxiliary objective improves representation quality. Secondly, we demonstrate that increasing the number of independent channels leads to significant improvements in embedding quality, while simultaneously reducing the number of parameters. Surprisingly, we demonstrate that this trend can be followed to the extreme, even to point of reducing the total number of non-embedding parameters to seven. Our system can be pre-trained from scratch with as little as 10M tokens of input data, and proves effective across English, Spanish and Afrikaans.
Paper Structure (14 sections, 3 equations, 2 figures, 4 tables)

This paper contains 14 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Self-StrAE forward pass. Red lines indicate cosine similarity between adjacent nodes. Shared colours indicate shared parameters.
  • Figure 2: Uniformity and Alignment plot for contrastive, cross entropy and CECO pre-training objectives. Results taken across four random seeds. Lower is better for both measures.