Table of Contents
Fetching ...

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen

TL;DR

This work introduces FREEFORM, a knowledge-driven framework that leverages large language models to perform feature selection and engineering on high-dimensional genotype data, addressing challenges of dimensionality and small sample sizes. The approach combines relevance filtering, self-consistent hierarchical and sequential forward selection, and automated feature engineering to create an ensemble of models trained on LLM-generated features, evaluated on Genomic Ancestry and Hereditary Hearing Loss datasets. Results show that LLM-driven feature selection and engineering outperform data-driven baselines in low-shot scenarios, with GPT-4o often delivering the best performance and the method remaining model-agnostic and cost-efficient. The framework demonstrates robust knowledge of genetic variants, achieves competitive AUC gains in few-shot regimes, and is released as open-source, offering a scalable, interpretable pathway for variant-level analysis in genomics.

Abstract

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

TL;DR

This work introduces FREEFORM, a knowledge-driven framework that leverages large language models to perform feature selection and engineering on high-dimensional genotype data, addressing challenges of dimensionality and small sample sizes. The approach combines relevance filtering, self-consistent hierarchical and sequential forward selection, and automated feature engineering to create an ensemble of models trained on LLM-generated features, evaluated on Genomic Ancestry and Hereditary Hearing Loss datasets. Results show that LLM-driven feature selection and engineering outperform data-driven baselines in low-shot scenarios, with GPT-4o often delivering the best performance and the method remaining model-agnostic and cost-efficient. The framework demonstrates robust knowledge of genetic variants, achieves competitive AUC gains in few-shot regimes, and is released as open-source, offering a scalable, interpretable pathway for variant-level analysis in genomics.

Abstract

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.
Paper Structure (14 sections, 3 equations, 5 figures, 2 tables)

This paper contains 14 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the FreeForm framework. The pipeline consists of two parts: (1) LLM-driven feature selection takes $d$ variants and selects $d'$ of them (2) Given the selected features, we use LLMs to generate sets of engineered features to create an ensemble of classifiers.
  • Figure 2: Example of Detailed Instructions
  • Figure 3: Evaluation of Feature Selection on Ancestry and Hearing Loss
  • Figure 4: Evaluation of Feature Engineering on Ancestry and Hearing Loss
  • Figure 5: Comparison between different LLM models on their knowledge of the SNP rs671 relating to genomic ancestry. Red text indicates a hallucination, which was only observed in the case of the Llama 2 7B model.