One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations
Kohei Oda, Po-Min Chuang, Kiyoaki Shirai, Natthawut Kertkeidkachorn
TL;DR
This work tackles the limitation of single-vector sentence embeddings by introducing DualCSE, which learns two representations per sentence: an explicit-semantic vector $\mathbf{r}$ and an implicit-semantic vector $\mathbf{u}$, trained in a shared space with a contrastive objective that models explicit, implicit, and cross-relations using the INLI dataset. The method is instantiated in two encoder designs—Cross-encoder and Bi-encoder—and evaluated on Recognizing Textual Entailment (RTE) and Estimating Implicitness Score (EIS), demonstrating improved performance over strong SimCSE baselines and competitive results with LLMs. Ablation and retrieval analyses reveal the contributions of each loss term and confirm the practical value of retrieving sentences by explicit versus implicit semantics. While promising, the approach relies on INLI for training, suggesting future work on diverse domains and potential integration with large language models to broaden applicability and robustness.
Abstract
Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.
