Zero-Shot Hierarchical Classification on the Common Procurement Vocabulary Taxonomy

Federico Moiraghi; Matteo Palmonari; Davide Allavena; Federico Morando

Zero-Shot Hierarchical Classification on the Common Procurement Vocabulary Taxonomy

Federico Moiraghi, Matteo Palmonari, Davide Allavena, Federico Morando

TL;DR

The paper tackles the challenge of classifying public tenders within the European CPV taxonomy, a large, imbalanced, and not-leaf-mandatory hierarchical label space. It introduces a Hierarchical Cross-Encoder (HCE) based on a pre-trained language model to perform zero-shot classification by directly comparing contract descriptions and metadata to short label descriptions, while exploring the taxonomy top-down. In experiments on Italian tender data, HCE matches or exceeds baseline hierarchical approaches, shows robust zero-shot performance on unseen labels, and delivers improved results when considering multiple top candidates, though it incurs slower inference. The work highlights practical implications for improving access to tender data, fraud detection, and knowledge-base completion, while also outlining future work to speed up inference and refine integration with existing systems.

Abstract

Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach, based on a pre-trained language model that relies only on label description and respects the label taxonomy. To train our proposed model, we used industrial data, which comes from contrattipubblici.org, a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years. Results show that the proposed model achieves better performance in classifying low-frequent classes compared to three different baselines, and is also able to predict never-seen classes.

Zero-Shot Hierarchical Classification on the Common Procurement Vocabulary Taxonomy

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 2 figures, 7 tables)

This paper contains 22 sections, 2 equations, 2 figures, 7 tables.

Introduction
Related Works
Hierarchical Classification
Error Propagation.
Stopping Strategy.
Sampling Strategy.
CPV Prediction
CPV Taxonomy & the Dataset
Hierarchical Classification with Cross-encoder
Encoding and Interpretation
Training
Hierarchical Inference
Experiments
Data pre-processing
Baselines
...and 7 more sections

Figures (2)

Figure 1: An example of an input document (above), composed by the main field "object" and other meta-data, and a portion of the taxonomy used for classification (below). Each label (numerical code) has a canonical description in 24 different languages. Notice "reservoy" instead of "reservoir": such errors are quite common.
Figure 2: Some descriptive metrics about both the taxonomy and the dataset: frequency of the labels in the training data (top left, logarithmic scale); IRlBP distribution (top right, logaritmic scale); number of children per node (middle left, logarithmic scale); cumulative number of classes per depth (middle right, cumulative); number of words per label description (bottom).

Zero-Shot Hierarchical Classification on the Common Procurement Vocabulary Taxonomy

TL;DR

Abstract

Zero-Shot Hierarchical Classification on the Common Procurement Vocabulary Taxonomy

Authors

TL;DR

Abstract

Table of Contents

Figures (2)