Table of Contents
Fetching ...

LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning

Yuhan Sun, Zhiwei Huang, Wanqing Cui, Shaopan Xiong, Yazhi Guo, Meiguang Jin, Junfeng Ma

TL;DR

LiveThinking tackles the challenge of real-time, high-quality reasoning in AI-powered livestreaming by combining a knowledge-distillation stage from a 670B teacher to a 30B MoE student with a second-stage reinforcement learning stage (GRPO) that compresses reasoning trajectories for sub-second latency. The two-stage process preserves correctness and helpfulness while dramatically reducing computation and decoding cost, achieving state-of-the-art results on industrial Tblive-E-Commerce QA and public MuSiQue benchmarks, and delivering strong real-world impact in Taobao Live. The work identifies and mitigates inherited verbose reasoning from distillation, demonstrates MoE architecture's efficiency, and provides a practical deployment blueprint for low-latency, high-quality conversational AI in time-sensitive settings. Overall, LiveThinking offers a generalizable paradigm for deploying capable yet efficient reasoning models in interactive, latency-constrained domains with real-world commercial benefits such as increased GMV and engagement.

Abstract

In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher's verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model's reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.

LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning

TL;DR

LiveThinking tackles the challenge of real-time, high-quality reasoning in AI-powered livestreaming by combining a knowledge-distillation stage from a 670B teacher to a 30B MoE student with a second-stage reinforcement learning stage (GRPO) that compresses reasoning trajectories for sub-second latency. The two-stage process preserves correctness and helpfulness while dramatically reducing computation and decoding cost, achieving state-of-the-art results on industrial Tblive-E-Commerce QA and public MuSiQue benchmarks, and delivering strong real-world impact in Taobao Live. The work identifies and mitigates inherited verbose reasoning from distillation, demonstrates MoE architecture's efficiency, and provides a practical deployment blueprint for low-latency, high-quality conversational AI in time-sensitive settings. Overall, LiveThinking offers a generalizable paradigm for deploying capable yet efficient reasoning models in interactive, latency-constrained domains with real-world commercial benefits such as increased GMV and engagement.

Abstract

In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher's verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model's reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.

Paper Structure

This paper contains 39 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Conversational e-commerce assistance within AI-powered livestreaming. icon indicates the AI's response. Green text indicates high correctness; Orange text indicates high helpfulness.
  • Figure 2: The proposed two-stage methodology consists of: (1) an initial knowledge distillation stage using RFT, followed by (2) a second stage dedicated to optimizing the efficiency of the reasoning path via GRPO.
  • Figure 3: Reward Function for Reasoning Length Optimization
  • Figure 4: Performance Evaluation across Length Ratio
  • Figure 5: RFT Training Eval Loss Curve of Different Models
  • ...and 2 more figures