Feature-to-Image Data Augmentation: Improving Model Feature Extraction with Cluster-Guided Synthetic Samples
Yasaman Haghbin, Hadi Moradi, Reshad Hosseini
TL;DR
This work tackles data scarcity in healthcare by introducing FICAug, a two-stage augmentation framework that operates in feature space and then reconstructs samples into the image domain for CNN training. It clusters latent features, generates class-pure synthetic samples via Gaussian sampling, and maps them back to realistic images using GANimation-based reconstruction, followed by fine-tuning on real data. On a Parkinson's disease facial-expression dataset, FICAug yields strong improvements, with image-space CNN training achieving a cross-validation of 88.63% and a test accuracy of 94.00%, outperforming several baselines. The method demonstrates that structured, cluster-aware augmentation combined with image-domain reconstruction can enhance representation learning in settings with limited labeled data and is potentially generalizable to other small-data domains.
Abstract
One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many real-world applications, particularly in medical and low-resource domains, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This study introduces FICAug, a novel feature-to-image data augmentation framework designed to improve model generalization under limited data conditions by generating structured synthetic samples. FICAug first operates in the feature space, where original data are clustered using the k-means algorithm. Within pure-label clusters, synthetic data are generated through Gaussian sampling to increase diversity while maintaining label consistency. These synthetic features are then projected back into the image domain using a generative neural network, and a convolutional neural network is trained on the reconstructed images to learn enhanced representations. Experimental results demonstrate that FICAug significantly improves classification accuracy. In feature space, it achieved a cross-validation accuracy of 84.09%, while training a ResNet-18 model on the reconstructed images further boosted performance to 88.63%, illustrating the effectiveness of the proposed framework in extracting new and task-relevant features.
