Informed deep hierarchical classification: a non-standard analysis inspired approach
Lorenzo Fiaschi, Marco Cococcioni
TL;DR
The paper addresses hierarchical classification by reframing HC as a lexicographic multi-objective problem and then embedding this structure within an NSA-inspired deep network (LH-DNN). By introducing projection-based non-interference via a non-standard loss and leveraging the transfer principle, LH-DNNs enable principled, lexicographic learning that prioritizes coarse-level accuracy while refining finer levels. Empirical results on CIFAR10, CIFAR100, and Fashion-MNIST show LH-DNNs achieve comparable or superior accuracy and substantially higher hierarchy coherency, with far fewer parameters and shorter training times than the B-CNN baseline. The approach promises practical benefits for real-world HC tasks, offering faster convergence, better adherence to hierarchy constraints, and efficient resource usage.
Abstract
This work proposes a novel approach to the deep hierarchical classification task, i.e., the problem of classifying data according to multiple labels organized in a rigid parent-child structure. It consists in a multi-output deep neural network equipped with specific projection operators placed before each output layer. The design of such an architecture, called lexicographic hybrid deep neural network (LH-DNN), has been possible by combining tools from different and quite distant research fields: lexicographic multi-objective optimization, non-standard analysis, and deep learning. To assess the efficacy of the approach, the resulting network is compared against the B-CNN, a convolutional neural network tailored for hierarchical classification tasks, on the CIFAR10, CIFAR100 (where it has been originally and recently proposed before being adopted and tuned for multiple real-world applications) and Fashion-MNIST benchmarks. Evidence states that an LH-DNN can achieve comparable if not superior performance, especially in the learning of the hierarchical relations, in the face of a drastic reduction of the learning parameters, training epochs, and computational time, without the need for ad-hoc loss functions weighting values.
