Table of Contents
Fetching ...

Two Heads are Better than One: Distilling Large Language Model Features Into Small Models with Feature Decomposition and Mixture

Tianhao Fu, Xinxin Xu, Weichen Xu, Jue Chen, Ruilong Ren, Bowen Deng, Xinyu Zhao, Jian Cao, Xixin Cao

TL;DR

The paper tackles the latency bottleneck of deploying LLMs for market making by introducing a mechanistic probe and a two-stage distillation framework. It first analyzes LLM features with a Normalized Fluorescent Probe, revealing layer- and data-type specialization, then presents Orthogonal Feature Decomposition Distillation (OFDD) to decouple features along layer, task, and data axes, complemented by Hájek Projection-based Mixture-of-Experts (Hájek-MoE) for adaptive fusion. The authors demonstrate that specialized lightweight models can learn distinct LLM features, achieving higher profitability, better risk management, and lower latency thanRL-based and traditional KD methods on multiple futures datasets. This approach offers a practical path to real-time, energy-efficient use of large language models in finance, with robust performance under extreme conditions and across varying market regimes. The key contribution is an integrated framework that combines mechanistic feature analysis with multi-axis distillation and kernel-based expert fusion to distill rich LLM knowledge into compact, responsive trading agents.

Abstract

Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM's feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.

Two Heads are Better than One: Distilling Large Language Model Features Into Small Models with Feature Decomposition and Mixture

TL;DR

The paper tackles the latency bottleneck of deploying LLMs for market making by introducing a mechanistic probe and a two-stage distillation framework. It first analyzes LLM features with a Normalized Fluorescent Probe, revealing layer- and data-type specialization, then presents Orthogonal Feature Decomposition Distillation (OFDD) to decouple features along layer, task, and data axes, complemented by Hájek Projection-based Mixture-of-Experts (Hájek-MoE) for adaptive fusion. The authors demonstrate that specialized lightweight models can learn distinct LLM features, achieving higher profitability, better risk management, and lower latency thanRL-based and traditional KD methods on multiple futures datasets. This approach offers a practical path to real-time, energy-efficient use of large language models in finance, with robust performance under extreme conditions and across varying market regimes. The key contribution is an integrated framework that combines mechanistic feature analysis with multi-axis distillation and kernel-based expert fusion to distill rich LLM knowledge into compact, responsive trading agents.

Abstract

Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM's feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the market-making workflow. A standard market-making algorithm analyzes historical market data and outputs future ordering strategies. We do a simple experiment that utilizes an LLM prompted with input to directly predict the future mid-price, spread, and volume, which is used to construct future orders via classic price and volume arithmetic sequences. We find that the LLM-based approach surpasses the performance of traditional RL algorithms. Furthermore, with our proposed distilled method, the small model demonstrates further significant improvements that could be used in a real-time scenario.
  • Figure 2: Overview of the CMM Framework. Left: LLM Feature Decomposition and Distillation. The complex feature space of an LLM is decomposed across three dimensions: layer, task, and data. Such three variables result in various types of features, where each feature type is learned by a specialized small model, thereby effectively representing the comprehensive LLM feature space through a collection of smaller models. Right: Inference with Hájek-MoE. Hájek-MoE employs a kernel function to project the output and feature of each small model into a shared feature space to obtain each model's confidence score. The final prediction is computed by aggregating each model's output with the scores.
  • Figure 3: Progressive feature decomposition visualization by our probe results. With stronger decoupling conditions, the LLM features exhibit clearer separation between clusters. Furthermore, a specialization across model depth is observed: shallow layers prioritize mid-price prediction, middle layers focus on the spread, and deep layers are geared towards total volume.
  • Figure 4: Tree diagram of Orthogonal Feature Decomposition Distillation. By varying three decoupling variables, complex LLM features are decomposed into simpler components and distilled into specialized small models.
  • Figure 5: Feature Decomposition Analysis