Table of Contents
Fetching ...

Subspace Representations for Soft Set Operations and Sentence Similarities

Yoichi Ishibashi, Sho Yokoi, Katsuhito Sudoh, Satoshi Nakamura

TL;DR

This work proposes representing word sets as linear subspaces within pretrained embedding spaces, grounded in quantum logic, to enable robust set operations (union, intersection, complement) and soft membership. By replacing hard, binary membership with a subspace indicator function and extending sentence similarity metrics (BERTScore) to SubspaceBERTScore, the approach captures richer semantic structure and improves performance on semantic textual similarity and set retrieval tasks. The method requires no additional training and leverages existing embeddings, offering a principled way to manipulate word sets and measure their similarity. Empirical results across STS benchmarks and a set-retrieval dataset show consistent gains over traditional vector-based and fuzzy-set methods, highlighting practical implications for NLP tasks involving concept sets and sentence understanding.

Abstract

In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks.

Subspace Representations for Soft Set Operations and Sentence Similarities

TL;DR

This work proposes representing word sets as linear subspaces within pretrained embedding spaces, grounded in quantum logic, to enable robust set operations (union, intersection, complement) and soft membership. By replacing hard, binary membership with a subspace indicator function and extending sentence similarity metrics (BERTScore) to SubspaceBERTScore, the approach captures richer semantic structure and improves performance on semantic textual similarity and set retrieval tasks. The method requires no additional training and leverages existing embeddings, offering a principled way to manipulate word sets and measure their similarity. Empirical results across STS benchmarks and a set-retrieval dataset show consistent gains over traditional vector-based and fuzzy-set methods, highlighting practical implications for NLP tasks involving concept sets and sentence understanding.

Abstract

In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks.
Paper Structure (38 sections, 14 equations, 2 figures, 6 tables, 6 algorithms)

This paper contains 38 sections, 14 equations, 2 figures, 6 tables, 6 algorithms.

Figures (2)

  • Figure 1: Superiority of subspace representations: Our rgb]0.92,0.96,1subspace representation (blue) surpasses the traditional rgb]0.9,0.9,0.9vector set representation (gray) in both text similarity and text concept set retrieval tasks.
  • Figure 2: Comparison between the proposed SubspaceBERTScore and BERTScore. We visualize the alignment process between the word royalty and the words in the sentence $B$. SubspaceBERTScore represents $B$ as the subspace $\mathbb{S}_{\mathit{B}}$ and calculates the similarity (canonical angle) between $\mathbb{S}_{\mathit{B}}$ and the royalty vector ($\boldsymbol{a}_{4}$). Our approach provides a "softer" alignment, capturing the overall semantic context of the sentence. On the other hand, BERTScore adopts a "harder" alignment strategy, selecting only the word from the sentence with the maximum cosine similarity.