Table of Contents
Fetching ...

Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts

Yuejiang Liu, Alexandre Alahi

TL;DR

The paper addresses weak-to-strong generalization under large capability gaps by proposing Co-Supervised Learning (CSL), a hierarchical mixture of experts that uses multiple fixed weak teachers to supervise a strong student. It introduces an EM-like framework with teacher assignment and noise reduction, enabling the student to benefit from specialized supervision while rejecting noisy annotations through teacher-student and local-global consistency. Empirical results on OpenAI's weak-to-strong benchmark, ImageNet, and DomainNet show that CSL with multiple specialists and denoising consistently improves performance gap recovery by substantial margins (e.g., over 15% on ImageNet and up to 17% on DomainNet) compared to single-teacher baselines. The work demonstrates a practical pathway to align powerful models using diverse, imperfect supervision and highlights its potential to extend beyond vision tasks in future research.

Abstract

Steering the behavior of a strong model pre-trained on internet-scale data can be difficult due to the scarcity of competent supervisors. Recent studies reveal that, despite supervisory noises, a strong student model may surpass its weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of such weak-to-strong generalization remains limited, especially in the presence of large capability gaps. In this paper, we propose to address this challenge by harnessing a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student. Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision: (i) we progressively alternate student training and teacher assignment, leveraging the growth of the strong student to identify plausible supervisions; (ii) we conservatively enforce teacher-student and local-global consistency, leveraging their dependencies to reject potential annotation noises. We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.

Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts

TL;DR

The paper addresses weak-to-strong generalization under large capability gaps by proposing Co-Supervised Learning (CSL), a hierarchical mixture of experts that uses multiple fixed weak teachers to supervise a strong student. It introduces an EM-like framework with teacher assignment and noise reduction, enabling the student to benefit from specialized supervision while rejecting noisy annotations through teacher-student and local-global consistency. Empirical results on OpenAI's weak-to-strong benchmark, ImageNet, and DomainNet show that CSL with multiple specialists and denoising consistently improves performance gap recovery by substantial margins (e.g., over 15% on ImageNet and up to 17% on DomainNet) compared to single-teacher baselines. The work demonstrates a practical pathway to align powerful models using diverse, imperfect supervision and highlights its potential to extend beyond vision tasks in future research.

Abstract

Steering the behavior of a strong model pre-trained on internet-scale data can be difficult due to the scarcity of competent supervisors. Recent studies reveal that, despite supervisory noises, a strong student model may surpass its weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of such weak-to-strong generalization remains limited, especially in the presence of large capability gaps. In this paper, we propose to address this challenge by harnessing a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student. Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision: (i) we progressively alternate student training and teacher assignment, leveraging the growth of the strong student to identify plausible supervisions; (ii) we conservatively enforce teacher-student and local-global consistency, leveraging their dependencies to reject potential annotation noises. We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.
Paper Structure (23 sections, 8 equations, 8 figures, 1 algorithm)

This paper contains 23 sections, 8 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of co-supervised learning for weak-to-strong generalization. We revisit the hierarchical mixture-of-experts method in the context of superalignment, and present an approach that leverages multiple weak supervisors with different specializations to collectively supervise a strong student model.
  • Figure 2: Effectiveness of vanilla weak-to-strong generalization. The performance gap recovery (PGR) is notable when the performance of the supervisor is close to the ceiling performance of the strong student (0.74), but limited when the supervisor lags behind.
  • Figure 3: An example of two-level hierarchical weak supervisors. A generalist supervisor $\pi_0$ is first branched into two specialists $\{\pi_{11}, \pi_{12}\}$ and further branched into three $\{\pi_{21}, \pi_{22}, \pi_{23}\}$. While each specialist focuses on only a segment of the problem domain, the combined expertise at each level ensures domain coverage.
  • Figure 4: Illustration of the alternating teacher assignment and student training processes. The output from the latest student serves as a proxy for the target, guiding the selection of the most appropriate weak supervisor. The chosen supervisor is then utilized to enhance the fine-tuning of the strong student.
  • Figure 5: Progressive noise reduction via teacher-student consistency ($\pi_{km} \rightarrow \theta_{km}$) and local-global consistency ($\theta_{km} \rightarrow \theta_{k}$).
  • ...and 3 more figures