Self-StrAE at SemEval-2024 Task 1: Making Self-Structuring AutoEncoders Learn More With Less
Mattia Opper, N. Siddharth
TL;DR
This paper addresses data- and parameter-efficient learning of semantic embeddings by improving Self-StrAE, a self-structuring autoencoder that builds hierarchical representations. It demonstrates two concrete enhancements: incorporating discrete leaf reconstruction via cross-entropy and dramatically increasing the number of independent channels (with smaller channel size) to reduce non-embedding parameters, achieving strong performance with as little as 10M pretraining tokens. The CECO objective (combining cross-entropy leaf reconstruction and contrastive node-level signals) consistently outperforms single-objective variants across English, Spanish, and Afrikaans, including cross-language transfer to related tasks. Overall, the work highlights the potential of explicit hierarchical inductive bias for resource-efficient NLP, with practical implications for low-resource languages and scalable semantic representations.
Abstract
This paper presents two simple improvements to the Self-Structuring AutoEncoder (Self-StrAE). Firstly, we show that including reconstruction to the vocabulary as an auxiliary objective improves representation quality. Secondly, we demonstrate that increasing the number of independent channels leads to significant improvements in embedding quality, while simultaneously reducing the number of parameters. Surprisingly, we demonstrate that this trend can be followed to the extreme, even to point of reducing the total number of non-embedding parameters to seven. Our system can be pre-trained from scratch with as little as 10M tokens of input data, and proves effective across English, Spanish and Afrikaans.
