A Novel GAN Approach to Augment Limited Tabular Data for Short-Term Substance Use Prediction

Nguyen Thach; Patrick Habecker; Bergen Johnston; Lillianna Cervantes; Anika Eisenbraun; Alex Mason; Kimberly Tyler; Bilal Khan; Hau Chan

A Novel GAN Approach to Augment Limited Tabular Data for Short-Term Substance Use Prediction

Nguyen Thach, Patrick Habecker, Bergen Johnston, Lillianna Cervantes, Anika Eisenbraun, Alex Mason, Kimberly Tyler, Bilal Khan, Hau Chan

TL;DR

The paper tackles predicting short-term substance use from HDLSS longitudinal survey data with embedded skip logic by introducing a CTGAN-based GAN augmented with an auxiliary classifier, global feature selection, and explicit skip-logic enforcement. The method generates high-quality synthetic tabular data that respect skip constraints and bolster predictive models for two tasks: (A) whether usage increases and (B) the ordinal frequency within 12 months, across drugs such as marijuana, meth, amphetamines, and cocaine. Key contributions include handling skip logic in data generation, addressing high-dimensionality with HDLSS-aware learning, and achieving AUROC improvements up to $13.4\%$ for Problem A and $15.8\%$ for Problem B, with only modest overhead for skip-logic enforcement. These results demonstrate the practical value of synthetic data augmentation in resource-constrained public health settings, enabling more accurate targeting for intervention and support services for PWUDs.

Abstract

Substance use is a global issue that negatively impacts millions of persons who use drugs (PWUDs). In practice, identifying vulnerable PWUDs for efficient allocation of appropriate resources is challenging due to their complex use patterns (e.g., their tendency to change usage within months) and the high acquisition costs for collecting PWUD-focused substance use data. Thus, there has been a paucity of machine learning models for accurately predicting short-term substance use behaviors of PWUDs. In this paper, using longitudinal survey data of 258 PWUDs in the U.S. Great Plains collected by our team, we design a novel GAN that deals with high-dimensional low-sample-size tabular data and survey skip logic to augment existing data to improve classification models' prediction on (A) whether the PWUDs would increase usage and (B) at which ordinal frequency they would use a particular drug within the next 12 months. Our evaluation results show that, when trained on augmented data from our proposed GAN, the classification models improve their predictive performance (AUROC) by up to 13.4% in Problem (A) and 15.8% in Problem (B) for usage of marijuana, meth, amphetamines, and cocaine, which outperform state-of-the-art generative models.

A Novel GAN Approach to Augment Limited Tabular Data for Short-Term Substance Use Prediction

TL;DR

for Problem A and

for Problem B, with only modest overhead for skip-logic enforcement. These results demonstrate the practical value of synthetic data augmentation in resource-constrained public health settings, enabling more accurate targeting for intervention and support services for PWUDs.

Abstract

Paper Structure (32 sections, 2 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 2 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Our Approach and Associated Challenges.
Our Contributions.
Problem Description
Background
Our Tabular Data.
Definition 1 (Skip Logic).
Definition 2 (Tabular Data Generation).
Definition 3 (GAN and Its Extensions).
Problem Formulation
Our Proposed GAN
Overview
Conditional Generator
Proper Conditional Generation in HDLSS Setting.
Auxiliary Classifier
...and 17 more sections

Figures (3)

Figure 1: Skip Logic: If respondents answer "No" in TB3, they will automatically be directed to TB5 without being asked on TB4.
Figure 2: Workflow for evaluating the efficacy of generative models. The "Classification models" blocks refer to the same set of classifiers listed in the definition of \ref{['par:compatibility']}. Each of these blocks takes a table as input for training the classifiers and outputs their predictions on $\mathbf{T}_{test}$. The blue dashed arrows represent the computation of the respective scores/metrics.
Figure 3: Efficacy of considered generative models in Problem (A) (top) and Problem (B) (bottom). Each column considers one drug. The red dashed lines in Figure \ref{['sfig:utility']} mark the average AUROC of classification models trained on $\mathbf{T}_{train}$ without data augmentation.

A Novel GAN Approach to Augment Limited Tabular Data for Short-Term Substance Use Prediction

TL;DR

Abstract

A Novel GAN Approach to Augment Limited Tabular Data for Short-Term Substance Use Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)