Table of Contents
Fetching ...

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye

Abstract

Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Abstract

Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.
Paper Structure (28 sections, 4 equations, 5 figures, 2 tables)

This paper contains 28 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of hard sample (Historical EMA Loss $>$ 0.7) reduction efficiency during training. Our TSRL framework (red) demonstrates a significantly faster convergence in resolving difficult samples compared to the traditional baseline (blue), as highlighted by the logarithmic scale.
  • Figure 2: A simplified overview of our proposed Tutor-Student Reinforcement Learning (TSRL) framework. The Tutor (RL Agent) learns a policy to dynamically assign weights to training samples (e.g., up-weighting "Hard" samples, down-weighting "Easy" samples) to optimize the Student (Detector) for generalization.
  • Figure 3: The Tutor-Student Reinforcement Learning (TSRL) Framework
  • Figure 4: Visual comparison of the average AUC (on DF40) for the standard CORE baseline, CORE with a static Curriculum Learning (CL) heuristic, and the full CORE + TSRL framework. Our dynamic TSRL approach shows a significant performance gain over both other methods.
  • Figure 5: UMAP visualization of feature spaces. We present two comparative visualizations. (a) Fake vs Real: Visualization by class (Green: Real, Red: Fake). The Baseline model (left) exhibits a single manifold with heavy class overlap, indicating a confused feature space. In contrast, our TSRL framework (right) learns a perfectly disentangled representation, cleanly separating all Real samples (green arc) from all Fake samples (red arc and cloud). (b) Fake Only: Visualization of only Fake samples, colored by difficulty (Blue/Cyan: Easy, Purple/Magenta: Hard). The Baseline (left) shows a single, continuous arc with easy (blue/cyan) samples mixed throughout. Our TSRL model (right) again demonstrates superior structure, partitioning the Fake samples into two distinct clusters: a separate cloud of "Easy Fakes" (blue/cyan) and a primary arc of "Hard Fakes" (purple).