Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression
Li Wan, Tansu Alpcan, Margreta Kuijper, Emanuele Viterbo
TL;DR
The paper tackles text classification with a lightweight, interpretable approach by coupling LZW-driven dictionary learning with discriminative refinement guided by label information. It introduces a minimum-distortion-longest-match sparse coding (MDLM) step and a discriminative-power-maximization (DPM) criterion, framed within an information-bottleneck analysis and evaluated using a novel IPAR metric. Across six benchmark datasets, the method achieves competitive accuracy on limited-vocabulary tasks using only a fraction of the parameters of deep models, while highlighting limitations on diverse vocabularies due to the repetitiveness assumptions of LZW. This work advances practical, interpretable text representations suitable for resource-constrained settings and provides a principled information-theoretic lens for understanding representation compression and task relevance.
Abstract
We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a dictionary from text datasets, focusing on the conceptual significance of dictionary elements. Subsequently, dictionaries are refined considering label data, optimizing dictionary atoms to enhance discriminative power based on mutual information and class distribution. This process generates discriminative numerical representations, facilitating the training of simple classifiers such as SVMs and neural networks. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance. Tested on six benchmark text datasets, our algorithm competes closely with top models, especially in limited-vocabulary contexts, using significantly fewer parameters. \review{Our algorithm closely matches top-performing models, deviating by only ~2\% on limited-vocabulary datasets, using just 10\% of their parameters. However, it falls short on diverse-vocabulary datasets, likely due to the LZW algorithm's constraints with low-repetition data. This contrast highlights its efficiency and limitations across different dataset types.
