Table of Contents
Fetching ...

Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences

Artem Sakhno, Ivan Sergeev, Alexey Shestov, Omar Zoloev, Elizaveta Kovtun, Gleb Gusev, Andrey Savchenko, Maksim Makarenko

Abstract

Industrial financial systems operate on temporal event sequences such as transactions, user actions, and system logs. While recent research emphasizes representation learning and large language models, production systems continue to rely heavily on handcrafted statistical features due to their interpretability, robustness under limited supervision, and strict latency constraints. This creates a persistent disconnect between learned embeddings and feature-based pipelines. We introduce Embedding-Aware Feature Discovery (EAFD), a unified framework that bridges this gap by coupling pretrained event-sequence embeddings with a self-reflective LLM-driven feature generation agent. EAFD iteratively discovers, evaluates, and refines features directly from raw event sequences using two complementary criteria: \emph{alignment}, which explains information already encoded in embeddings, and \emph{complementarity}, which identifies predictive signals missing from them. Across both open-source and industrial transaction benchmarks, EAFD consistently outperforms embedding-only and feature-based baselines, achieving relative gains of up to $+5.8\%$ over state-of-the-art pretrained embeddings, resulting in new state-of-the-art performance across event-sequence datasets.

Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences

Abstract

Industrial financial systems operate on temporal event sequences such as transactions, user actions, and system logs. While recent research emphasizes representation learning and large language models, production systems continue to rely heavily on handcrafted statistical features due to their interpretability, robustness under limited supervision, and strict latency constraints. This creates a persistent disconnect between learned embeddings and feature-based pipelines. We introduce Embedding-Aware Feature Discovery (EAFD), a unified framework that bridges this gap by coupling pretrained event-sequence embeddings with a self-reflective LLM-driven feature generation agent. EAFD iteratively discovers, evaluates, and refines features directly from raw event sequences using two complementary criteria: \emph{alignment}, which explains information already encoded in embeddings, and \emph{complementarity}, which identifies predictive signals missing from them. Across both open-source and industrial transaction benchmarks, EAFD consistently outperforms embedding-only and feature-based baselines, achieving relative gains of up to over state-of-the-art pretrained embeddings, resulting in new state-of-the-art performance across event-sequence datasets.
Paper Structure (19 sections, 3 equations, 5 figures, 5 tables)

This paper contains 19 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Blind spots of embeddings. Coefficient of determination ($R^2$) of a feature reconstruction based on embedding representations CoLES, NTP, and LLM4ES on the Rosbank dataset. Values highlight systematic representational blind spots across embeddings.
  • Figure 2: Overview of Embedding-Aware Feature Discovery (EAFD).(a) Latent embedding pipeline mapping event sequences to continuous user representations. (b) EAFD agent loop: an LLM-based generator proposes interpretable features from raw sequences, evaluated by embedding–feature alignment and downstream utility, with reflective updates guiding iteration. (c) Feature outcomes: aligned features recover information encoded in the embedding, while complementary features capture predictive factors missing from it. (d) Embedding–feature space decomposition into latent, interpretable, and blind-spot regions. (e) Deployment and refinement: discovered features improve downstream tasks and support targeted encoder refinement (e.g., coverage, robustness, privacy).
  • Figure 3: EAFD iteration dynamics of Rosbank validation ROC-AUC and feature composition across EAFD iterations.
  • Figure 4: Task-adaptive distribution of discovered feature types. Feature category proportions generated by EAFD for age, gender, and regression tasks.
  • Figure 5: Selective feature erasure in embeddings. Top panels show $R^2$ reconstruction performance before and after erasing mcc (a) and trx_amount (b) feature categories, while the bottom panels report the corresponding relative performance change ($\Delta R^2$).