Table of Contents
Fetching ...

Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

Rohit Gupta, Anirban Roy, Claire Christensen, Sujeong Kim, Sarah Gerard, Madeline Cincebeaux, Ajay Divakaran, Todd Grindal, Mubarak Shah

TL;DR

The paper addresses automatic detection of fine-grained educational content in child-targeted videos under a multilabel setting. It introduces a class-prototype based supervised contrastive learning framework that learns class prototypes $cp_k$ and optimizes a loss $L_{mlc}$, integrated with a multimodal transformer to fuse visual frames and ASR text. A new APPROVE dataset with 193 hours and 19 classes (7 literacy, 11 math, plus background) demonstrates state-of-the-art results and strong generalization to YouTube-8M subset and COIN. Overall, the approach enables safer, higher-quality content filtering for early learners by leveraging accurate multimodal representations and prototype-guided decision making.

Abstract

The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include `letter names', `letter sounds', and math codes include `counting', `sorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., `letter names' vs `letter sounds'). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE

Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

TL;DR

The paper addresses automatic detection of fine-grained educational content in child-targeted videos under a multilabel setting. It introduces a class-prototype based supervised contrastive learning framework that learns class prototypes and optimizes a loss , integrated with a multimodal transformer to fuse visual frames and ASR text. A new APPROVE dataset with 193 hours and 19 classes (7 literacy, 11 math, plus background) demonstrates state-of-the-art results and strong generalization to YouTube-8M subset and COIN. Overall, the approach enables safer, higher-quality content filtering for early learners by leveraging accurate multimodal representations and prototype-guided decision making.

Abstract

The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include `letter names', `letter sounds', and math codes include `counting', `sorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., `letter names' vs `letter sounds'). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE

Paper Structure

This paper contains 26 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Sample video frames from the APPROVE dataset. Videos belong to the (a) literacy classes, (b) math classes, and (c) background. Background videos do not contain educational content but share visual similarities with educational videos. The videos are labeled with fine-grained sub-classes, e.g., letter names vs letter sounds.
  • Figure 2: Frequency of the classes in APPROVE. Math codes are in Orange and literacy codes in Blue.
  • Figure 3: Distribution of the number of labels per video.
  • Figure 4: Contrastive learning operates on the feature space by bringing the representations of similar samples close and pushing distinct samples apart. Prior work in (a) Supervised Contrastive Learningsupcon trains the network by treating instances from the same class as positive pairs and instances from different classes as negative pairs. This approach doesn't generalize to multi-label classification tasks, as some instance pairs have partially overlapping labels. We propose the use of class prototypes to enable (b) Multi-Label Prototypes Contrastive Learning. Each sample and the class prototypes corresponding to the labels associated with the sample are treated as positive pairs. Similarly, negative pairs are determined based on the missing class labels. Prototypes are represented by stars ($\star$) and inputs as circles ($\circ$) colored with all their relevant labels. We discuss strategies for initializing and learning the label prototypes in Sec. \ref{['sec:prototypes']}.
  • Figure 5: Multi-Modal Classification Network. A text encoder is used to encode ASR text from the video, while an Image Encoder is used to get tokens representing each frame of the video. Unimodal pre-training is carried out on the text & image encoders respectively. Multi-label contrastive loss is used along with shared prototypes to align the representations across both modalities. This is followed by joint end-to-end learning of the whole multi-modal network including the fusion encoder which applies multi-head self-attention within and across the modalities. The prototypes are further refined during the multi-modal training phase.
  • ...and 7 more figures