Table of Contents
Fetching ...

GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification

Sarmad N. Mohammed, Semra Gündüç

TL;DR

This work proposes a new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model, which uses the graph structure of sentences to capture different types of information from text data, such as syntactic, semantic, and hidden content.

Abstract

One of the prime problems of computer science and machine learning is to extract information efficiently from large-scale, heterogeneous data. Text data, with its syntax, semantics, and even hidden information content, possesses an exceptional place among the data types in concern. The processing of the text data requires embedding, a method of translating the content of the text to numeric vectors. A correct embedding algorithm is the starting point for obtaining the full information content of the text data. In this work, a new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed. The model uses the graph structure of sentences to capture different types of information from text data, such as syntactic, semantic, and hidden content. Using random walks on a weighted word graph, GTPM calculates transition probabilities to derive text embedding vectors. The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms. GTPM shows significantly better classification performance for binary and multi-class datasets than well-known algorithms. Additionally, the proposed method demonstrates superior robustness, maintaining performance with limited (only $10\%$) training data, showing an $8\%$ decline compared to $15-20\%$ for baseline methods.

GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification

TL;DR

This work proposes a new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model, which uses the graph structure of sentences to capture different types of information from text data, such as syntactic, semantic, and hidden content.

Abstract

One of the prime problems of computer science and machine learning is to extract information efficiently from large-scale, heterogeneous data. Text data, with its syntax, semantics, and even hidden information content, possesses an exceptional place among the data types in concern. The processing of the text data requires embedding, a method of translating the content of the text to numeric vectors. A correct embedding algorithm is the starting point for obtaining the full information content of the text data. In this work, a new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed. The model uses the graph structure of sentences to capture different types of information from text data, such as syntactic, semantic, and hidden content. Using random walks on a weighted word graph, GTPM calculates transition probabilities to derive text embedding vectors. The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms. GTPM shows significantly better classification performance for binary and multi-class datasets than well-known algorithms. Additionally, the proposed method demonstrates superior robustness, maintaining performance with limited (only ) training data, showing an decline compared to for baseline methods.
Paper Structure (16 sections, 6 equations, 6 figures, 5 tables)

This paper contains 16 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Generation of word graph for four iterations. At time step 1, the first document is introduced, and a graph with three words is generated. At time step 2, a second document is introduced, which adds two new nodes to $w_4$ and $w_5$, increases the edge value between $w_2$ and $w_3$, and so on. In the last time step, all edges are normalized with the maximum edge value.
  • Figure 2: Degree distribution of words in the generated word-graph using the Reuters dataset.
  • Figure 3: Representation of the study.
  • Figure 4: Visualization of 2D representations for Reuters dataset.
  • Figure 5: Test accuracy (Micro-F1 $\%$) with different walk lengths ($a$) and number of walks per node ($b$).
  • ...and 1 more figures