Table of Contents
Fetching ...

Meta-Learning Reinforcement Learning for Crypto-Return Prediction

Junqiao Wang, Zhaoyang Guan, Guanyu Liu, Tianze Xia, Xianzhi Li, Shuo Yin, Xinyuan Song, Chuhan Cheng, Tianyu Shi, Alex Lee

TL;DR

Meta-RL-Crypto introduces a self-improving, transformer-based trading agent that unifies meta-learning with reinforcement learning in a triple-loop architecture (Actor, Judge, Meta-Judge) to predict crypto returns using multimodal data. The framework combines on-chain and off-chain signals with a multi-objective reward design and a Generalized Preference-based Reinforcement Optimization loop to continually refine both policy and evaluation criteria without human supervision. It demonstrates superior performance and interpretability across BTC, ETH, and SOL under multiple market regimes, outperforming strong LLM and financial AI baselines. The work advances practical AI for finance by delivering a robust, self-adaptive, and interpretable crypto trading system suitable for fast-changing markets.

Abstract

Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines.

Meta-Learning Reinforcement Learning for Crypto-Return Prediction

TL;DR

Meta-RL-Crypto introduces a self-improving, transformer-based trading agent that unifies meta-learning with reinforcement learning in a triple-loop architecture (Actor, Judge, Meta-Judge) to predict crypto returns using multimodal data. The framework combines on-chain and off-chain signals with a multi-objective reward design and a Generalized Preference-based Reinforcement Optimization loop to continually refine both policy and evaluation criteria without human supervision. It demonstrates superior performance and interpretability across BTC, ETH, and SOL under multiple market regimes, outperforming strong LLM and financial AI baselines. The work advances practical AI for finance by delivering a robust, self-adaptive, and interpretable crypto trading system suitable for fast-changing markets.

Abstract

Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines.

Paper Structure

This paper contains 21 sections, 7 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overall Architecture of Meta-RL-Crypto. The system consists of a shared LLM that cyclically adopts the roles of Actor, Judge, and Meta-Judge. Market signals (on-chain metrics, off-chain news, sentiment scores) are encoded into structured prompts used by the Actor to generate forecasts. Each prediction is then scored by the Judge using a multi-dimensional reward vector, which the Meta-Judge uses to enforce preference consistency and evaluate the Judge itself.
  • Figure 2: Meta-RL-Crypto Architecture. The diagram illustrates the cyclical roles of Actor, Judge, and Meta-Judge, showing how data is processed through the system. Each role contributes to improving the model’s performance, from generating forecasts (Actor) to evaluating them (Judge), and refining evaluations (Meta-Judge).