Multiple Instance Learning for Cheating Detection and Localization in Online Examinations

Yemeng Liu; Jing Ren; Jianshuo Xu; Xiaomei Bai; Roopdeep Kaur; Feng Xia

Multiple Instance Learning for Cheating Detection and Localization in Online Examinations

Yemeng Liu, Jing Ren, Jianshuo Xu, Xiaomei Bai, Roopdeep Kaur, Feng Xia

TL;DR

This work tackles cheating detection in online examinations by framing it as a weakly supervised video anomaly problem. It introduces CHEESE, a framework that couples a MIL-based label generator with a multi-modal feature encoder and a spatio-temporal graph module to detect and localize cheating events using cues from eye gaze, head pose, facial actions, body pose, and background. The key contributions are (i) a continuous sub-bag MIL labeling strategy, (ii) a self-guided attention-enhanced encoder, (iii) a dual-graph spatio-temporal module incorporating temporal consistency and feature similarity, and (iv) comprehensive experiments across UCF-Crime, ShanghaiTech, and OEP showing strong performance and real-time feasibility. The approach demonstrates actionable detection and localization capabilities with practical relevance for online proctoring, and points to future work in expanding multi-modal data and mitigating pseudo-label noise to further improve robustness and accuracy.

Abstract

The spread of the Coronavirus disease-2019 epidemic has caused many courses and exams to be conducted online. The cheating behavior detection model in examination invigilation systems plays a pivotal role in guaranteeing the equality of long-distance examinations. However, cheating behavior is rare, and most researchers do not comprehensively take into account features such as head posture, gaze angle, body posture, and background information in the task of cheating behavior detection. In this paper, we develop and present CHEESE, a CHEating detection framework via multiplE inStancE learning. The framework consists of a label generator that implements weak supervision and a feature encoder to learn discriminative features. In addition, the framework combines body posture and background features extracted by 3D convolution with eye gaze, head posture and facial features captured by OpenFace 2.0. These features are fed into the spatio-temporal graph module by stitching to analyze the spatio-temporal changes in video clips to detect the cheating behaviors. Our experiments on three datasets, UCF-Crime, ShanghaiTech and Online Exam Proctoring (OEP), prove the effectiveness of our method as compared to the state-of-the-art approaches, and obtain the frame-level AUC score of 87.58% on the OEP dataset.

Multiple Instance Learning for Cheating Detection and Localization in Online Examinations

TL;DR

Abstract

Paper Structure (24 sections, 13 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 13 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Video Anomaly Detection
Online Proctoring
Multiple Instance Learning
Graph Convolutional Neural Network
Problem Definition
Methodology
Multiple Instance Learning for Label Generation
The Composition of Feature Encoder
Self-Guided Attention Module
Multi-modal Feature Fusion
Spatio-temporal Graph Module
Loss Function
Experiments
...and 9 more sections

Figures (7)

Figure 1: The flow chart of the proposed CHEESE which consists of a multiple instance label generator $G$ and a feature encoder $FE$ followed by a spatio-temporal graph module. We utilize feature extractor $E$ to provide clip-level features for the label generator and apply the clip-level labels $Y^a =\left \{y_i^a\right \}$ and $Y^n$ to train the feature encoder in the second stage. Specifically, the positive and negative bags are divided according to the video-level label $Y=0/1$. $y_i^a$ is generated by the generator, and $Y^n$ can be directly derived from the video-level label ($Y=0$).
Figure 2: The structure of our label generator. After the features are given, we exploit continuous sampling to obtain sub-bags. Each sub-bag contains the features of T consecutive clips.
Figure 3: The structure of our dual branch attention enhanced feature encoder. $F_4$, $F_5$, $F^*$ and $A^*$ are characteristic maps. $M_1$ and $M_2$ are two coding modules constructed by convolutional layer. $L_1$ and $L_2$ are cross entropy loss functions. GAP is the global average pooling operation, and Avg represents the operation of channel-level average pooling.
Figure 4: The variations of AUC for different values of the multiple detector $K$ in self-guided attention module on OEP dataset using I3D.
Figure 5: The variations in AUC on OEP dataset using I3D by changing the total number of consecutive instances in a sub-bag $T$.
...and 2 more figures

Multiple Instance Learning for Cheating Detection and Localization in Online Examinations

TL;DR

Abstract

Multiple Instance Learning for Cheating Detection and Localization in Online Examinations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)