Learning Using Generated Privileged Information by Text-to-Image Diffusion Models
Rafael-Edy Menadil, Mariana-Iuliana Georgescu, Radu Tudor Ionescu
TL;DR
The paper addresses the challenge of leveraging privileged information when it is not readily available for text classification. It introduces Learning Using Generated Privileged Information (LUGPI), which generates artificial visual privileged data for each text using a diffusion model, trains multimodal teachers on text+image pairs, and distills their knowledge into a unimodal text student to avoid any increase in inference cost. Empirical results across four datasets show that the proposed approach improves over a text-only baseline and even surpasses the multimodal teachers in some cases, validating the effectiveness of synthetic privileged data. The method achieves these gains without changing the test-time cost, highlighting its practical impact for scalable NLP systems that can benefit from cross-modal guidance during training.
Abstract
Learning Using Privileged Information is a particular type of knowledge distillation where the teacher model benefits from an additional data representation during training, called privileged information, improving the student model, which does not see the extra representation. However, privileged information is rarely available in practice. To this end, we propose a text classification framework that harnesses text-to-image diffusion models to generate artificial privileged information. The generated images and the original text samples are further used to train multimodal teacher models based on state-of-the-art transformer-based architectures. Finally, the knowledge from multimodal teachers is distilled into a text-based (unimodal) student. Hence, by employing a generative model to produce synthetic data as privileged information, we guide the training of the student model. Our framework, called Learning Using Generated Privileged Information (LUGPI), yields noticeable performance gains on four text classification data sets, demonstrating its potential in text classification without any additional cost during inference.
