Table of Contents
Fetching ...

FlatFormer: A Flat Transformer Knowledge Tracing Model Based on Cognitive Bias Injection

Xiao-li Xia, Hou-biao Li

TL;DR

Knowledge Tracing models face a persistent trade-off between predictive accuracy and computational efficiency. The authors introduce FlatFormer, a flat Transformer augmented with two lightweight cognitive injections—session-aware input embedding and a precomputed power-law forgetting bias in attention—to emulate hierarchical cognitive dynamics without extra architectural complexity. Across four large KT datasets, FlatFormer achieves state-of-the-art or near-SOTA performance with far fewer parameters and faster inference than heavyweight baselines, validating the information-injection paradigm. Ablation and robustness analyses show both injections contribute meaningfully and are robust to sequence length and hyperparameters, underlining practical impact for real-time ITS deployment.

Abstract

Knowledge Tracing (KT) models face a critical ``Performance-Complexity Trap'': capturing complex cognitive dynamics like learning sessions and memory decay typically requires deep hierarchical architectures, which incur prohibitive computational costs for real-time deployment. To resolve this, we propose FlatFormer, a streamlined architecture based on the novel design paradigm of ``Information Injection over Structural Stacking.'' Unlike parameter-heavy hierarchical models, FlatFormer leverages a standard flat Transformer augmented with two lightweight injection mechanisms: (i) a hybrid input encoding strategy combining learnable session identifiers with fixed sinusoidal step embeddings; and (ii) a pre-computed power-law bias integrated directly into attention logits to explicitly model the forgetting curve. Extensive experiments on four large-scale datasets (e.g., EdNet, Junyi) show that FlatFormer achieves state-of-the-art performance. For example, on the EdNet dataset, compared to the strongest hierarchical baseline (HiTSKT), its absolute AUC increased by 8.3%, while using less than 15% of parameters, and inference speed was about three times faster. These results validate that high cognitive fidelity does not necessitate architectural complexity.

FlatFormer: A Flat Transformer Knowledge Tracing Model Based on Cognitive Bias Injection

TL;DR

Knowledge Tracing models face a persistent trade-off between predictive accuracy and computational efficiency. The authors introduce FlatFormer, a flat Transformer augmented with two lightweight cognitive injections—session-aware input embedding and a precomputed power-law forgetting bias in attention—to emulate hierarchical cognitive dynamics without extra architectural complexity. Across four large KT datasets, FlatFormer achieves state-of-the-art or near-SOTA performance with far fewer parameters and faster inference than heavyweight baselines, validating the information-injection paradigm. Ablation and robustness analyses show both injections contribute meaningfully and are robust to sequence length and hyperparameters, underlining practical impact for real-time ITS deployment.

Abstract

Knowledge Tracing (KT) models face a critical ``Performance-Complexity Trap'': capturing complex cognitive dynamics like learning sessions and memory decay typically requires deep hierarchical architectures, which incur prohibitive computational costs for real-time deployment. To resolve this, we propose FlatFormer, a streamlined architecture based on the novel design paradigm of ``Information Injection over Structural Stacking.'' Unlike parameter-heavy hierarchical models, FlatFormer leverages a standard flat Transformer augmented with two lightweight injection mechanisms: (i) a hybrid input encoding strategy combining learnable session identifiers with fixed sinusoidal step embeddings; and (ii) a pre-computed power-law bias integrated directly into attention logits to explicitly model the forgetting curve. Extensive experiments on four large-scale datasets (e.g., EdNet, Junyi) show that FlatFormer achieves state-of-the-art performance. For example, on the EdNet dataset, compared to the strongest hierarchical baseline (HiTSKT), its absolute AUC increased by 8.3%, while using less than 15% of parameters, and inference speed was about three times faster. These results validate that high cognitive fidelity does not necessitate architectural complexity.

Paper Structure

This paper contains 56 sections, 15 equations, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Conceptual comparison between simplified sequence assumptions and hierarchical cognitive processes. (a) Cognitive dynamics illustrating intra-session interactions ($\tau_t$) and inter-session consolidation ($s_t$). (b) Comparison of hierarchical architectures (left) with the proposed FlatFormer framework (right).
  • Figure 2: High-level architecture of FlatFormer. Unlike hierarchical approaches, FlatFormer utilizes a standard flat encoder injected with (1) Session Features at the input level and (2) a Forgetting Bias at the attention level to efficiently model cognitive processes.
  • Figure 3: The FlatFormer Model Architecture. This diagram illustrates the overall workflow, highlighting the two key injection points: (i) Session-Awareness at the Input Layer (Injection-i), and (ii) Forgetting Bias within the Attention Layer (Injection-ii). The model processes raw interactions to predict future student performance.
  • Figure 4: The Architecture of Injection-i. The model constructs a session-aware input representation by aggregating content embeddings ($E_{content}$), a learnable session ID embedding ($P_{session}$), and a fixed frequency-based step encoding ($P_{step}$). This design directly addresses sessional blindness while enabling infinite extrapolation.
  • Figure 5: Architecture of the FlatFormer Encoder Block. This diagram illustrates the internal mechanism of the $l$-th layer, specifically highlighting the Injection-ii process where the Power-Law Forgetting Bias is additively injected into the MHSA logits to resolve "forgetting blindness". The block consists of a Causal MHSA module followed by a Position-wise FFN.
  • ...and 9 more figures