Table of Contents
Fetching ...

A Consistent Lebesgue Measure for Multi-label Learning

Kaan Demir, Bach Nguyen, Bing Xue, Mengjie Zhang

TL;DR

The paper tackles the difficulty of inconsistent guidance from multiple, non-differentiable multi-label losses by introducing CLML, a Consistent Lebesgue Measure-based learner that directly optimises a Lebesgue-measure objective across several losses. It provides a theoretical consistency result under Bayes risk and demonstrates practical state-of-the-art performance on nine datasets using a simple feedforward architecture, without relying on label graphs or perturbation-based conditioning. The approach relies on Monte Carlo estimation of Lebesgue contributions and CMA-ES optimization to navigate the non-convex, non-differentiable loss landscape, yielding robust improvements and revealing insights about surrogate-versus-desired loss dynamics. The work highlights the importance of optimization consistency in multi-label learning and offers a scalable, surrogate-free path toward balancing multiple, potentially conflicting objectives with tangible empirical benefits.

Abstract

Multi-label loss functions are usually non-differentiable, requiring surrogate loss functions for gradient-based optimisation. The consistency of surrogate loss functions is not proven and is exacerbated by the conflicting nature of multi-label loss functions. To directly learn from multiple related, yet potentially conflicting multi-label loss functions, we propose a Consistent Lebesgue Measure-based Multi-label Learner (CLML) and prove that CLML can achieve theoretical consistency under a Bayes risk framework. Empirical evidence supports our theory by demonstrating that: (1) CLML can consistently achieve state-of-the-art results; (2) the primary performance factor is the Lebesgue measure design, as CLML optimises a simpler feedforward model without additional label graph, perturbation-based conditioning, or semantic embeddings; and (3) an analysis of the results not only distinguishes CLML's effectiveness but also highlights inconsistencies between the surrogate and the desired loss functions.

A Consistent Lebesgue Measure for Multi-label Learning

TL;DR

The paper tackles the difficulty of inconsistent guidance from multiple, non-differentiable multi-label losses by introducing CLML, a Consistent Lebesgue Measure-based learner that directly optimises a Lebesgue-measure objective across several losses. It provides a theoretical consistency result under Bayes risk and demonstrates practical state-of-the-art performance on nine datasets using a simple feedforward architecture, without relying on label graphs or perturbation-based conditioning. The approach relies on Monte Carlo estimation of Lebesgue contributions and CMA-ES optimization to navigate the non-convex, non-differentiable loss landscape, yielding robust improvements and revealing insights about surrogate-versus-desired loss dynamics. The work highlights the importance of optimization consistency in multi-label learning and offers a scalable, surrogate-free path toward balancing multiple, potentially conflicting objectives with tangible empirical benefits.

Abstract

Multi-label loss functions are usually non-differentiable, requiring surrogate loss functions for gradient-based optimisation. The consistency of surrogate loss functions is not proven and is exacerbated by the conflicting nature of multi-label loss functions. To directly learn from multiple related, yet potentially conflicting multi-label loss functions, we propose a Consistent Lebesgue Measure-based Multi-label Learner (CLML) and prove that CLML can achieve theoretical consistency under a Bayes risk framework. Empirical evidence supports our theory by demonstrating that: (1) CLML can consistently achieve state-of-the-art results; (2) the primary performance factor is the Lebesgue measure design, as CLML optimises a simpler feedforward model without additional label graph, perturbation-based conditioning, or semantic embeddings; and (3) an analysis of the results not only distinguishes CLML's effectiveness but also highlights inconsistencies between the surrogate and the desired loss functions.
Paper Structure (28 sections, 4 theorems, 19 equations, 9 figures, 7 tables)

This paper contains 28 sections, 4 theorems, 19 equations, 9 figures, 7 tables.

Key Result

Theorem 4.3

$\psi$ can only be multi-label consistent w.r.t. $\mathcal{L}$ iff it holds for any sequence of $f^{(n)}$ that:

Figures (9)

  • Figure 1: The overall proposed approach of CLML is outlined as follows. (a) illustrates the representation of $f$. (b) illustrates the contribution of each $f^i$ toward the improvement over all three loss functions $\boldsymbol{\mathcal{L}}(f^i) = (\mathcal{L}_1(f^i),\mathcal{L}_2(f^i),\mathcal{L}_3(f^i))$, which is quantified as the non-overlapping volume of space that $\boldsymbol{\mathcal{L}}(f^i)$ uniquely covers over a set of $W$ models $F=\cup_{i=1}^{W}\{f^i\}$, and a reference vector $R=\{1\}^3$. (c) illustrates the overall Lebesgue measure over $F$, which is the aggregate volume of all $f^i\in F$.
  • Figure 1: Medians of the geometric means for all methods across all datasets.
  • Figure 2: Bonferroni-Dunn test critical difference plots. A crossbar is drawn between CLML and any method if their difference in average ranking is less than the critical difference ($CD=2.686$ with $K=7$ methods and $T=9$ datasets obtained from a studentised range table).
  • Figure 3: The training curves of CLML plotted against $\mathcal{L}_1(f(\textbf{X}),\textbf{Y})$, $\mathcal{L}_2(f(\textbf{X}),\textbf{Y})$, and $\mathcal{L}_3(f(\textbf{X}),\textbf{Y})$. The colour represents the averaged binary cross-entropy loss $\mathcal{L}_4(f(\textbf{X}),\textbf{Y})$, which is tracked independently during the optimisation process. The red line shows the moving average trajectory of CLML. A zoom-in plot is presented at the top right of each subplot to highlight the area of convergence.
  • Figure 4: Results of CLML's incumbent solution shown on the emotions dataset before (after 1 epoch) and after training (all epoch). The number of bins for the distributions and the calibration plots are $B_D=50$ and $B_C=10$, respectively.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Definition 4.1: Conditional Risk
  • Definition 4.2: Bayes Predictors
  • Theorem 4.3: Multi-label Consistency
  • Definition 4.4: Pareto optimal set
  • Theorem 4.5: A Consistent Lebesgue Measure
  • proof : Proof of Theorem \ref{['theoremconsistency']}
  • Definition 6.1: Metric Risk
  • Corollary 6.2: Below-bounded and Interval
  • Lemma 6.3: The Lebesgue Contribution Equals Lebesgue Improvement
  • proof