Table of Contents
Fetching ...

Hierarchical thematic classification of major conference proceedings

Arsentii Kuzmin, Alexander Aduenko, Vadim Strijov

TL;DR

The paper tackles hierarchical text classification with a fixed expert tree under partial labeling. It introduces a weighted hierarchical similarity framework that combines word-entropy based word importance with branch-level ranking, and a Bayesian variational EM approach to estimate parameters and topic probabilities. Key contributions include a word-weighting scheme $\lambda_m = 1 + {\boldsymbol{\alpha}}^{T} {\boldsymbol{\iota}}_m$ with $\iota_{m,\ell} = \log(1 + H^{\ell}(w_m))$, branch-specific weight vectors $\boldsymbol{\theta}_k$, and a joint model $p(\mathbf{Z}, \boldsymbol{\theta}, \boldsymbol{\alpha}, \mathbf{m}, \mathbf{V})$ with priors; an EM algorithm and a variational bound enable scalable inference of leaf-topic probabilities. Empirical results on EURO conference abstracts and industrial websites show that the proposed hSim method achieves competitive or superior ranking quality (AUCH) compared to hierarchical Naive Bayes and SVM baselines, supporting practical use for building thematic models of large, tree-structured corpora. The approach offers a principled, interpretable framework for hierarchical topic ranking with partial supervision and can generalize to other tree-like structures and domains.

Abstract

In this paper, we develop a decision support system for the hierarchical text classification. We consider text collections with a fixed hierarchical structure of topics given by experts in the form of a tree. The system sorts the topics by relevance to a given document. The experts choose one of the most relevant topics to finish the classification. We propose a weighted hierarchical similarity function to calculate topic relevance. The function calculates the similarity of a document and a tree branch. The weights in this function determine word importance. We use the entropy of words to estimate the weights. The proposed hierarchical similarity function formulates a joint hierarchical thematic classification probability model of the document topics, parameters, and hyperparameters. The variational Bayesian inference gives a closed-form EM algorithm. The EM algorithm estimates the parameters and calculates the probability of a topic for a given document. Compared to hierarchical multiclass SVM, hierarchical PLSA with adaptive regularization, and hierarchical naive Bayes, the weighted hierarchical similarity function has better improvement in ranking accuracy in an abstract collection of a major conference EURO and a website collection of industrial companies.

Hierarchical thematic classification of major conference proceedings

TL;DR

The paper tackles hierarchical text classification with a fixed expert tree under partial labeling. It introduces a weighted hierarchical similarity framework that combines word-entropy based word importance with branch-level ranking, and a Bayesian variational EM approach to estimate parameters and topic probabilities. Key contributions include a word-weighting scheme with , branch-specific weight vectors , and a joint model with priors; an EM algorithm and a variational bound enable scalable inference of leaf-topic probabilities. Empirical results on EURO conference abstracts and industrial websites show that the proposed hSim method achieves competitive or superior ranking quality (AUCH) compared to hierarchical Naive Bayes and SVM baselines, supporting practical use for building thematic models of large, tree-structured corpora. The approach offers a principled, interpretable framework for hierarchical topic ranking with partial supervision and can generalize to other tree-like structures and domains.

Abstract

In this paper, we develop a decision support system for the hierarchical text classification. We consider text collections with a fixed hierarchical structure of topics given by experts in the form of a tree. The system sorts the topics by relevance to a given document. The experts choose one of the most relevant topics to finish the classification. We propose a weighted hierarchical similarity function to calculate topic relevance. The function calculates the similarity of a document and a tree branch. The weights in this function determine word importance. We use the entropy of words to estimate the weights. The proposed hierarchical similarity function formulates a joint hierarchical thematic classification probability model of the document topics, parameters, and hyperparameters. The variational Bayesian inference gives a closed-form EM algorithm. The EM algorithm estimates the parameters and calculates the probability of a topic for a given document. Compared to hierarchical multiclass SVM, hierarchical PLSA with adaptive regularization, and hierarchical naive Bayes, the weighted hierarchical similarity function has better improvement in ranking accuracy in an abstract collection of a major conference EURO and a website collection of industrial companies.
Paper Structure (16 sections, 44 equations, 8 figures, 2 tables)

This paper contains 16 sections, 44 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Structure of EURO conference.
  • Figure 2: Basic notation in hierarchical structure of the collection.
  • Figure 3: A branch with the number $k$ of a cluster hierarchy. The value of $\theta_k^\ell$ denote the weight of the cluster $c_{\ell, k}$ in the branch.
  • Figure 4: Values of $\tilde{g} = -\ln g({\mathbf{x}})$ for two-dimensional ${\mathbf{x}}$.
  • Figure 5: Example of parameters convergence.
  • ...and 3 more figures