Table of Contents
Fetching ...

UniEDU: A Unified Language and Vision Assistant for Education Applications

Zhendong Chu, Jian Xie, Shen Wang, Zichao Wang, Qingsong Wen

TL;DR

UniEDU addresses the challenge of integrating multimodal educational data across tasks by compressing long interaction histories into compact tokens and unifying four education tasks under a single generative framework. It combines a Profile Encoder with a Language Model that conditions on task instructions to produce outputs, enabling efficient long-context processing with about a 3x reduction in VRAM compared with baselines. Across knowledge recommendation, knowledge tracing, time cost prediction, and user answer prediction, UniEDU achieves strong performance, with notable gains in recommendation and cost-prediction accuracy and competitive tracing results. The approach supports real-world deployment by reducing computational overhead while maintaining competitive accuracy, advancing scalable, personalized education tools.

Abstract

Education materials for K-12 students often consist of multiple modalities, such as text and images, posing challenges for models to fully understand nuanced information in these materials. In this paper, we propose a unified language and vision assistant UniEDU designed for various educational applications, including knowledge recommendation, knowledge tracing, time cost prediction, and user answer prediction, all within a single model. Unlike conventional task-specific models, UniEDU offers a unified solution that excels across multiple educational tasks while maintaining strong generalization capabilities. Its adaptability makes it well-suited for real-world deployment in diverse learning environments. Furthermore, UniEDU is optimized for industry-scale deployment by significantly reducing computational overhead-achieving approximately a 300\% increase in efficiency-while maintaining competitive performance with minimal degradation compared to fully fine-tuned models. This work represents a significant step toward creating versatile AI systems tailored to the evolving demands of education.

UniEDU: A Unified Language and Vision Assistant for Education Applications

TL;DR

UniEDU addresses the challenge of integrating multimodal educational data across tasks by compressing long interaction histories into compact tokens and unifying four education tasks under a single generative framework. It combines a Profile Encoder with a Language Model that conditions on task instructions to produce outputs, enabling efficient long-context processing with about a 3x reduction in VRAM compared with baselines. Across knowledge recommendation, knowledge tracing, time cost prediction, and user answer prediction, UniEDU achieves strong performance, with notable gains in recommendation and cost-prediction accuracy and competitive tracing results. The approach supports real-world deployment by reducing computational overhead while maintaining competitive accuracy, advancing scalable, personalized education tools.

Abstract

Education materials for K-12 students often consist of multiple modalities, such as text and images, posing challenges for models to fully understand nuanced information in these materials. In this paper, we propose a unified language and vision assistant UniEDU designed for various educational applications, including knowledge recommendation, knowledge tracing, time cost prediction, and user answer prediction, all within a single model. Unlike conventional task-specific models, UniEDU offers a unified solution that excels across multiple educational tasks while maintaining strong generalization capabilities. Its adaptability makes it well-suited for real-world deployment in diverse learning environments. Furthermore, UniEDU is optimized for industry-scale deployment by significantly reducing computational overhead-achieving approximately a 300\% increase in efficiency-while maintaining competitive performance with minimal degradation compared to fully fine-tuned models. This work represents a significant step toward creating versatile AI systems tailored to the evolving demands of education.

Paper Structure

This paper contains 17 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The architecture of UniEDU. The profile encoder processes history interactions with multimodal information, while the language model integrates compressed history interactions and task instructions to generate the output.
  • Figure 2: Performance comparison of seven models on the Knowledge Recommendation task.
  • Figure 3: Performance comparison of UniEDU and baseline models on Knowledge Tracing, Time Cost Prediction, and User Answer Prediction.
  • Figure 4: Performance of UniEDU-5B with different numbers of compression tokens. The red dashed line indicates Qwen2-VL-2B with full fine-tuning.