Estimating Text Similarity based on Semantic Concept Embeddings
Tim vor der Brück, Marc Pouly
TL;DR
The paper tackles limitations of surface-based word embeddings in semantic similarity due to ambiguity and context. It introduces Semantic Concept Embeddings (CE) derived from MultiNet semantic networks, built from German texts via the Wocadi parser, and uses random walks to train CE alongside Word2Vec so that CE and word embeddings are comparable. The approach is applied to market segmentation by assigning contest participants to youth milieus, showing that CE combined with Word2Vec improves accuracy over baselines. Limitations include reliance on Wocadi, with future work aiming to explore freely available parsers and preserving inner SN nodes for richer representations.
Abstract
Due to their ease of use and high accuracy, Word2Vec (W2V) word embeddings enjoy great success in the semantic representation of words, sentences, and whole documents as well as for semantic similarity estimation. However, they have the shortcoming that they are directly extracted from a surface representation, which does not adequately represent human thought processes and also performs poorly for highly ambiguous words. Therefore, we propose Semantic Concept Embeddings (CE) based on the MultiNet Semantic Network (SN) formalism, which addresses both shortcomings. The evaluation on a marketing target group distribution task showed that the accuracy of predicted target groups can be increased by combining traditional word embeddings with semantic CEs.
