Unsupervised Generative Feature Transformation via Graph Contrastive Pre-training and Multi-objective Fine-tuning
Wangyang Ying, Dongjie Wang, Xuanming Hu, Yuanchun Zhou, Charu C. Aggarwal, Yanjie Fu
TL;DR
This work introduces NEAT, a label-free framework for unsupervised generative feature transformation by uniting graph-based representation, contrastive pretraining, and sequential generation. It defines a measurement-pretrain-finetune paradigm where feature-set utility is gauged via Mean Discounted Cumulative Gain (MDCG), feature-set embeddings are learned through graph contrastive learning on feature-feature graphs, and optimal transformed feature sequences are generated via an encoder-decoder-evaluator model guided by gradient-based optimization. The approach demonstrates strong empirical performance across 23 datasets, improves transformation quality over a range of baselines, and exhibits robustness and efficiency in both memory and convergence. NEAT thus offers a scalable, interpretable, and task-agnostic pathway to discover informative, non-linear feature transformations without labeled data, with potential applications across science and industry.
Abstract
Feature transformation is to derive a new feature set from original features to augment the AI power of data. In many science domains such as material performance screening, while feature transformation can model material formula interactions and compositions and discover performance drivers, supervised labels are collected from expensive and lengthy experiments. This issue motivates an Unsupervised Feature Transformation Learning (UFTL) problem. Prior literature, such as manual transformation, supervised feedback guided search, and PCA, either relies on domain knowledge or expensive supervised feedback, or suffers from large search space, or overlooks non-linear feature-feature interactions. UFTL imposes a major challenge on existing methods: how to design a new unsupervised paradigm that captures complex feature interactions and avoids large search space? To fill this gap, we connect graph, contrastive, and generative learning to develop a measurement-pretrain-finetune paradigm for UFTL. For unsupervised feature set utility measurement, we propose a feature value consistency preservation perspective and develop a mean discounted cumulative gain like unsupervised metric to evaluate feature set utility. For unsupervised feature set representation pretraining, we regard a feature set as a feature-feature interaction graph, and develop an unsupervised graph contrastive learning encoder to embed feature sets into vectors. For generative transformation finetuning, we regard a feature set as a feature cross sequence and feature transformation as sequential generation. We develop a deep generative feature transformation model that coordinates the pretrained feature set encoder and the gradient information extracted from a feature set utility evaluator to optimize a transformed feature generator.
