Table of Contents
Fetching ...

Learning from Teaching Regularization: Generalizable Correlations Should be Easy to Imitate

Can Jin, Tong Che, Hongwu Peng, Yiyuan Li, Dimitris N. Metaxas, Marco Pavone

TL;DR

The results suggest the effectiveness and efficiency of LoT in identifying generalizable information at the right scales while discarding spurious data correlations, thus making LoT a valuable addition to current machine learning.

Abstract

Generalization remains a central challenge in machine learning. In this work, we propose Learning from Teaching (LoT), a novel regularization technique for deep neural networks to enhance generalization. Inspired by the human ability to capture concise and abstract patterns, we hypothesize that generalizable correlations are expected to be easier to imitate. LoT operationalizes this concept to improve the generalization of the main model with auxiliary student learners. The student learners are trained by the main model and, in turn, provide feedback to help the main model capture more generalizable and imitable correlations. Our experimental results across several domains, including Computer Vision, Natural Language Processing, and methodologies like Reinforcement Learning, demonstrate that the introduction of LoT brings significant benefits compared to training models on the original dataset. The results suggest the effectiveness and efficiency of LoT in identifying generalizable information at the right scales while discarding spurious data correlations, thus making LoT a valuable addition to current machine learning. Code is available at https://github.com/jincan333/LoT.

Learning from Teaching Regularization: Generalizable Correlations Should be Easy to Imitate

TL;DR

The results suggest the effectiveness and efficiency of LoT in identifying generalizable information at the right scales while discarding spurious data correlations, thus making LoT a valuable addition to current machine learning.

Abstract

Generalization remains a central challenge in machine learning. In this work, we propose Learning from Teaching (LoT), a novel regularization technique for deep neural networks to enhance generalization. Inspired by the human ability to capture concise and abstract patterns, we hypothesize that generalizable correlations are expected to be easier to imitate. LoT operationalizes this concept to improve the generalization of the main model with auxiliary student learners. The student learners are trained by the main model and, in turn, provide feedback to help the main model capture more generalizable and imitable correlations. Our experimental results across several domains, including Computer Vision, Natural Language Processing, and methodologies like Reinforcement Learning, demonstrate that the introduction of LoT brings significant benefits compared to training models on the original dataset. The results suggest the effectiveness and efficiency of LoT in identifying generalizable information at the right scales while discarding spurious data correlations, thus making LoT a valuable addition to current machine learning. Code is available at https://github.com/jincan333/LoT.
Paper Structure (43 sections, 4 equations, 5 figures, 13 tables, 2 algorithms)

This paper contains 43 sections, 4 equations, 5 figures, 13 tables, 2 algorithms.

Figures (5)

  • Figure 1: Training and test KL-divergence losses of student models in LoT using ViT-B/16 and ViT-L/16 on CIFAR-100 with different teacher models. The sophisticated students achieve lower losses than the deceptive students given the same computational budget.
  • Figure 2: The episodic return of the teacher agent in LoT and the Teacher-only on four Atari games (averaged over ten runs). LoT demonstrates return gains over Teacher-only on all games.
  • Figure 3: Test accuracy of teacher models in LoT and Teacher-only using ViT-B/16 and ViT-L/16 on CIFAR-100. LoT achieves higher test accuracy with fewer training steps.
  • Figure 4: Effects of regularization coefficient $\alpha$ (left) and student steps ratio $N$ (right). $\alpha=1$ is the best $\alpha$ value to achieve the lowest test perplexity of the teacher model, and moderate student steps ratio $N$ such as 4 and 5 benefit the teacher model the most.
  • Figure 5: Training and test KL-divergence losses of student models in LoT using ResNet-18 and ResNet-50 on CIFAR-100 with different teacher models.