Table of Contents
Fetching ...

Estimating Text Similarity based on Semantic Concept Embeddings

Tim vor der Brück, Marc Pouly

TL;DR

The paper tackles limitations of surface-based word embeddings in semantic similarity due to ambiguity and context. It introduces Semantic Concept Embeddings (CE) derived from MultiNet semantic networks, built from German texts via the Wocadi parser, and uses random walks to train CE alongside Word2Vec so that CE and word embeddings are comparable. The approach is applied to market segmentation by assigning contest participants to youth milieus, showing that CE combined with Word2Vec improves accuracy over baselines. Limitations include reliance on Wocadi, with future work aiming to explore freely available parsers and preserving inner SN nodes for richer representations.

Abstract

Due to their ease of use and high accuracy, Word2Vec (W2V) word embeddings enjoy great success in the semantic representation of words, sentences, and whole documents as well as for semantic similarity estimation. However, they have the shortcoming that they are directly extracted from a surface representation, which does not adequately represent human thought processes and also performs poorly for highly ambiguous words. Therefore, we propose Semantic Concept Embeddings (CE) based on the MultiNet Semantic Network (SN) formalism, which addresses both shortcomings. The evaluation on a marketing target group distribution task showed that the accuracy of predicted target groups can be increased by combining traditional word embeddings with semantic CEs.

Estimating Text Similarity based on Semantic Concept Embeddings

TL;DR

The paper tackles limitations of surface-based word embeddings in semantic similarity due to ambiguity and context. It introduces Semantic Concept Embeddings (CE) derived from MultiNet semantic networks, built from German texts via the Wocadi parser, and uses random walks to train CE alongside Word2Vec so that CE and word embeddings are comparable. The approach is applied to market segmentation by assigning contest participants to youth milieus, showing that CE combined with Word2Vec improves accuracy over baselines. Limitations include reliance on Wocadi, with future work aiming to explore freely available parsers and preserving inner SN nodes for richer representations.

Abstract

Due to their ease of use and high accuracy, Word2Vec (W2V) word embeddings enjoy great success in the semantic representation of words, sentences, and whole documents as well as for semantic similarity estimation. However, they have the shortcoming that they are directly extracted from a surface representation, which does not adequately represent human thought processes and also performs poorly for highly ambiguous words. Therefore, we propose Semantic Concept Embeddings (CE) based on the MultiNet Semantic Network (SN) formalism, which addresses both shortcomings. The evaluation on a marketing target group distribution task showed that the accuracy of predicted target groups can be increased by combining traditional word embeddings with semantic CEs.
Paper Structure (8 sections, 2 figures, 3 tables)

This paper contains 8 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example for a random walk (yesterday.1.1 temp$^{-1}$objpropmodp* red.1.1) in the SN.
  • Figure 2: Scatterplot between similarity estimate based on word embeddings (x-axis) and semantic CEs (y-axis).