Table of Contents
Fetching ...

SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection

Huopu Zhang, Yanguang Liu, Miao Zhang, Zirui He, Mengnan Du

TL;DR

This work tackles the challenge of predicting earnings surprises from long, noisy financial documents by introducing SAE-FiRE, a framework that leverages Sparse Autoencoders to extract sparse, interpretable representations from frozen LLM residual activations. It combines two feature-selection strategies—ANOVA F-tests and tree-based importance—to identify the most discriminative SAE dimensions before training a logistic regression classifier, achieving robust performance across three diverse financial datasets. The approach outperforms strong baselines, including zero-/few-shot prompting and long-document models, and offers interpretable insights by mapping top SAE features to human-readable concepts. The results suggest that targeted, noise-filtered latent features can enhance generalization in financial text analytics and point to future extensions into multimodal and cross-lingual tasks.

Abstract

Predicting earnings surprises from financial documents, such as earnings conference calls, regulatory filings, and financial news, has become increasingly important in financial economics. However, these financial documents present significant analytical challenges, typically containing over 5,000 words with substantial redundancy and industry-specific terminology that creates obstacles for language models. In this work, we propose the SAE-FiRE (Sparse Autoencoder for Financial Representation Enhancement) framework to address these limitations by extracting key information while eliminating redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to decompose dense neural representations from large language models into interpretable sparse components, then applies statistical feature selection methods, including ANOVA F-tests and tree-based importance scoring, to identify the top-k most discriminative dimensions for classification. By systematically filtering out noise that might otherwise lead to overfitting, we enable more robust and generalizable predictions. Experimental results across three financial datasets demonstrate that SAE-FiRE significantly outperforms baseline approaches.

SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection

TL;DR

This work tackles the challenge of predicting earnings surprises from long, noisy financial documents by introducing SAE-FiRE, a framework that leverages Sparse Autoencoders to extract sparse, interpretable representations from frozen LLM residual activations. It combines two feature-selection strategies—ANOVA F-tests and tree-based importance—to identify the most discriminative SAE dimensions before training a logistic regression classifier, achieving robust performance across three diverse financial datasets. The approach outperforms strong baselines, including zero-/few-shot prompting and long-document models, and offers interpretable insights by mapping top SAE features to human-readable concepts. The results suggest that targeted, noise-filtered latent features can enhance generalization in financial text analytics and point to future extensions into multimodal and cross-lingual tasks.

Abstract

Predicting earnings surprises from financial documents, such as earnings conference calls, regulatory filings, and financial news, has become increasingly important in financial economics. However, these financial documents present significant analytical challenges, typically containing over 5,000 words with substantial redundancy and industry-specific terminology that creates obstacles for language models. In this work, we propose the SAE-FiRE (Sparse Autoencoder for Financial Representation Enhancement) framework to address these limitations by extracting key information while eliminating redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to decompose dense neural representations from large language models into interpretable sparse components, then applies statistical feature selection methods, including ANOVA F-tests and tree-based importance scoring, to identify the top-k most discriminative dimensions for classification. By systematically filtering out noise that might otherwise lead to overfitting, we enable more robust and generalizable predictions. Experimental results across three financial datasets demonstrate that SAE-FiRE significantly outperforms baseline approaches.

Paper Structure

This paper contains 40 sections, 6 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Diagram of our SAE‐based earnings‐surprise prediction pipeline. (a) extracting and pooling token-level SAE activations into a document representation, (b) selecting the most predictive features using statistical ranking methods, and (c) training a linear classifier on the filtered feature set.
  • Figure 2: Weighted F1, AUC, and Accuracy at different numbers of selected features. (Gemmma2-9B 131K)
  • Figure 3: Weighted F1, AUC, and Accuracy at different numbers of selected features. (Gemmma2-9B 131K)
  • Figure 4: Weighted F1, AUC, and Accuracy at different numbers of selected features (Gemma2-2B 16K)
  • Figure 5: ROC Curves of our model and baseline models
  • ...and 10 more figures