Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification

Jurica Levatić; Michelangelo Ceci; Dragi Kocev; Sašo Džeroski

Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification

Jurica Levatić, Michelangelo Ceci, Dragi Kocev, Sašo Džeroski

TL;DR

The paper addresses semi-supervised learning for structured outputs, focusing on multi-label and hierarchical multi-label classification. It proposes semi-supervised predictive clustering trees (SSL-PCTs) and ensembles (SSL-PCT-FR, SSL-RF, SSL-RF-FR) that incorporate unlabeled data via a weighted variance combining target and descriptive spaces, controlled by a tunable parameter $w$. Across 24 diverse datasets, SSL-PCTs often outperform their supervised counterparts, with interpretability retained through tree-based models, though training time increases; the optimal degree of supervision varies by dataset and is selected via internal CV. Overall, the work demonstrates that leveraging unlabeled data in the descriptive and target spaces can improve predictive performance for complex prediction tasks while preserving model interpretability, and it highlights practical guidelines for choosing the supervision level and feature weighting.

Abstract

Semi-supervised learning (SSL) is a common approach to learning predictive models using not only labeled examples, but also unlabeled examples. While SSL for the simple tasks of classification and regression has received a lot of attention from the research community, this is not properly investigated for complex prediction tasks with structurally dependent variables. This is the case of multi-label classification and hierarchical multi-label classification tasks, which may require additional information, possibly coming from the underlying distribution in the descriptive space provided by unlabeled examples, to better face the challenging task of predicting simultaneously multiple class labels. In this paper, we investigate this aspect and propose a (hierarchical) multi-label classification method based on semi-supervised learning of predictive clustering trees. We also extend the method towards ensemble learning and propose a method based on the random forest approach. Extensive experimental evaluation conducted on 23 datasets shows significant advantages of the proposed method and its extension with respect to their supervised counterparts. Moreover, the method preserves interpretability and reduces the time complexity of classical tree-based models.

Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification

TL;DR

. Across 24 diverse datasets, SSL-PCTs often outperform their supervised counterparts, with interpretability retained through tree-based models, though training time increases; the optimal degree of supervision varies by dataset and is selected via internal CV. Overall, the work demonstrates that leveraging unlabeled data in the descriptive and target spaces can improve predictive performance for complex prediction tasks while preserving model interpretability, and it highlights practical guidelines for choosing the supervision level and feature weighting.

Abstract

Paper Structure (26 sections, 11 equations, 7 figures, 11 tables)

This paper contains 26 sections, 11 equations, 7 figures, 11 tables.

Introduction
Related Work and Motivations
Background: Predictive clustering trees
PCTs for multi-label classification
PCTs for Hierarchical multi-label classification
Semi-supervised PCT learning for MLC and HMLC
Task definition
Semi-Supervised Multi-label classification
Semi-Supervised Hierarchical Multi-label classification
Tree learning
Semi-supervised PCTs with Feature weighting
Semi-supervised random forests
Computational complexity
Experimental design
Data description
...and 11 more sections

Figures (7)

Figure 1: Semi-supervised learning in multi-label classification. Filled circles represent labeled examples, while empty circles represent unlabeled examples. Letters represent class labels.
Figure 2: Predictive performance of the supervised and semi-supervised methods on the multi-label classification datasets.
Figure 3: Predictive performance of the supervised and semi-supervised methods on the hierarchical multi-label classification datasets.
Figure 4: Influence of parameter $w$ on SSL-PCT (red line) and SSL-RF (orange line) methods. The results refer to 4 datasets with different types of structured outputs: Emotions (MLC), Genbase (MLC), Danish farms (HMLC), and ImCLEF07A (HMLC). The $w$ values selected by the internal cross-validation algorithm and used in the experiments are marked with colored dots.
Figure 5: Supervised and semi-supervised predictive clustering trees obtained for the Emotions dataset with 100 labeled examples.
...and 2 more figures

Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification

TL;DR

Abstract

Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (7)