Table of Contents
Fetching ...

Latent Feature Mining for Predictive Model Enhancement with Large Language Models

Bingxuan Li, Pengyi Shi, Amy Ward

TL;DR

This work proposes FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features and enhance the predictive power of ML models in downstream tasks.

Abstract

Predictive modeling often faces challenges due to limited data availability and quality, especially in domains where collected features are weakly correlated with outcomes and where additional feature collection is constrained by ethical or practical difficulties. Traditional machine learning (ML) models struggle to incorporate unobserved yet critical factors. In this work, we introduce an effective approach to formulate latent feature mining as text-to-text propositional logical reasoning. We propose FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features and enhance the predictive power of ML models in downstream tasks. Our framework is generalizable across various domains with necessary domain-specific adaptation, as it is designed to incorporate contextual information unique to each area, ensuring effective transfer to different areas facing similar data availability challenges. We validate our framework with two case studies: (1) the criminal justice system, a domain characterized by limited and ethically challenging data collection; (2) the healthcare domain, where patient privacy concerns and the complexity of medical data limit comprehensive feature collection. Our results show that inferred latent features align well with ground truth labels and significantly enhance the downstream classifier.

Latent Feature Mining for Predictive Model Enhancement with Large Language Models

TL;DR

This work proposes FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features and enhance the predictive power of ML models in downstream tasks.

Abstract

Predictive modeling often faces challenges due to limited data availability and quality, especially in domains where collected features are weakly correlated with outcomes and where additional feature collection is constrained by ethical or practical difficulties. Traditional machine learning (ML) models struggle to incorporate unobserved yet critical factors. In this work, we introduce an effective approach to formulate latent feature mining as text-to-text propositional logical reasoning. We propose FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features and enhance the predictive power of ML models in downstream tasks. Our framework is generalizable across various domains with necessary domain-specific adaptation, as it is designed to incorporate contextual information unique to each area, ensuring effective transfer to different areas facing similar data availability challenges. We validate our framework with two case studies: (1) the criminal justice system, a domain characterized by limited and ethically challenging data collection; (2) the healthcare domain, where patient privacy concerns and the complexity of medical data limit comprehensive feature collection. Our results show that inferred latent features align well with ground truth labels and significantly enhance the downstream classifier.
Paper Structure (41 sections, 1 theorem, 10 equations, 13 figures, 5 tables)

This paper contains 41 sections, 1 theorem, 10 equations, 13 figures, 5 tables.

Key Result

Lemma 1

The in-sample log-loss always follows $\mathcal{L}^{\text{in}}(\tilde{D}, \tilde{\beta}^*) \leq \mathcal{L}^{\text{in}}(D, \beta^*)$. When the added features are non-informative, there exist instances such that the out-of-sample log-loss $\mathcal{L}^{\text{out}}(\tilde{D}, \tilde{\beta}^*) > \mathc

Figures (13)

  • Figure 1: The real-world example illustrating the motivation of FLAME, a framework to augment observed features collected in given datasets with latent features.
  • Figure 2: Example of latent feature mining through chain of reasoning. The latent feature "Supports Likely Needed" ($Z$) is inferred from the observed input features ($X$) via intermediate predicates ($O$), and is then used alongside $X$ to improve the prediction for outcome ($Y$).
  • Figure 3: Overview of latent feature inference framework.
  • Figure 4: Risk level prediction results: (a) Model accuracy; (b) F1 scores per-category. LR - logistic regression; MLP - Neural Networks; RF- random forest; GBT - Gradient Boosting Trees.
  • Figure 5: Outcome prediction results: (a) Model performance with/without the inferred latent features (program requirements); (b) feature importance plot. LR - logistic regression; MLP - Neural Networks; GBT - Gradient Boosting Trees.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Lemma 1