Table of Contents
Fetching ...

Span-Aggregatable, Contextualized Word Embeddings for Effective Phrase Mining

Eyal Orbach, Lev Haikin, Nelly David, Avi Faizakof

TL;DR

This work tackles the challenge of extracting paraphrase-like phrases embedded in noisy contexts by arguing that a single sentence-level dense representation is insufficient for sub-sentence phrase retrieval. It introduces SLICE, a Span-aggregated Late Interaction Contextualized Embeddings model, which produces contextualized token embeddings that can be aggregated to represent arbitrary spans, trained with a modified loss $L = -\lambda \mathrm{sim}_{true} + \log\left(e^{\lambda \mathrm{sim}_{true}} + e^{\lambda \mathrm{sim}_{false}}\right)$ to favor true-span similarity over false-span similarity. The authors also provide a dataset variant STS-B-Context for evaluating phrase-in-context similarity and compare three inference setups, showing SLICE offers better/competitive performance with manageable compute compared to span-wide or full-context baselines. Overall, the approach advances practical phrase mining by enabling efficient, span-aware, dense representations, with implications for retrieval in real-world corpora such as legal and contact-center data.

Abstract

Dense vector representations for sentences made significant progress in recent years as can be seen on sentence similarity tasks. Real-world phrase retrieval applications, on the other hand, still encounter challenges for effective use of dense representations. We show that when target phrases reside inside noisy context, representing the full sentence with a single dense vector, is not sufficient for effective phrase retrieval. We therefore look into the notion of representing multiple, sub-sentence, consecutive word spans, each with its own dense vector. We show that this technique is much more effective for phrase mining, yet requires considerable compute to obtain useful span representations. Accordingly, we make an argument for contextualized word/token embeddings that can be aggregated for arbitrary word spans while maintaining the span's semantic meaning. We introduce a modification to the common contrastive loss used for sentence embeddings that encourages word embeddings to have this property. To demonstrate the effect of this method we present a dataset based on the STS-B dataset with additional generated text, that requires finding the best matching paraphrase residing in a larger context and report the degree of similarity to the origin phrase. We demonstrate on this dataset, how our proposed method can achieve better results without significant increase to compute.

Span-Aggregatable, Contextualized Word Embeddings for Effective Phrase Mining

TL;DR

This work tackles the challenge of extracting paraphrase-like phrases embedded in noisy contexts by arguing that a single sentence-level dense representation is insufficient for sub-sentence phrase retrieval. It introduces SLICE, a Span-aggregated Late Interaction Contextualized Embeddings model, which produces contextualized token embeddings that can be aggregated to represent arbitrary spans, trained with a modified loss to favor true-span similarity over false-span similarity. The authors also provide a dataset variant STS-B-Context for evaluating phrase-in-context similarity and compare three inference setups, showing SLICE offers better/competitive performance with manageable compute compared to span-wide or full-context baselines. Overall, the approach advances practical phrase mining by enabling efficient, span-aware, dense representations, with implications for retrieval in real-world corpora such as legal and contact-center data.

Abstract

Dense vector representations for sentences made significant progress in recent years as can be seen on sentence similarity tasks. Real-world phrase retrieval applications, on the other hand, still encounter challenges for effective use of dense representations. We show that when target phrases reside inside noisy context, representing the full sentence with a single dense vector, is not sufficient for effective phrase retrieval. We therefore look into the notion of representing multiple, sub-sentence, consecutive word spans, each with its own dense vector. We show that this technique is much more effective for phrase mining, yet requires considerable compute to obtain useful span representations. Accordingly, we make an argument for contextualized word/token embeddings that can be aggregated for arbitrary word spans while maintaining the span's semantic meaning. We introduce a modification to the common contrastive loss used for sentence embeddings that encourages word embeddings to have this property. To demonstrate the effect of this method we present a dataset based on the STS-B dataset with additional generated text, that requires finding the best matching paraphrase residing in a larger context and report the degree of similarity to the origin phrase. We demonstrate on this dataset, how our proposed method can achieve better results without significant increase to compute.
Paper Structure (12 sections, 5 equations, 1 figure, 1 table)

This paper contains 12 sections, 5 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Examples from STS-B-Context. Target phrase are in bold for readability, actual dataset does not include any such markers.