Text Simplification with Sentence Embeddings
Matthew Shardlow
TL;DR
This work investigates whether sentence embeddings, specifically SONAR, can support text simplification by learning a transformation in embedding space that maps complex to simple representations. The authors decode reconstructed embeddings to generate simplified text and compare a small MLP-based embedding-space transform against Seq2Seq and LLM baselines, finding competitive results in a compact learning setup. They demonstrate reconstruction preserves complexity levels and extend the approach to unseen datasets and cross-lingual targets using a multilingual embedding space, with mixed success across German and Spanish. The study suggests embedding-space transformations are a promising avenue for lightweight, adaptable text simplification and other NLG tasks, while highlighting data quality, language transfer, and architectural limitations as areas for future work.
Abstract
Sentence embeddings can be decoded to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and LLM-based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.
