Table of Contents
Fetching ...

Efficient Multi-Task Inferencing: Model Merging with Gromov-Wasserstein Feature Alignment

Luyang Fang, Ehsan Latif, Haoran Lu, Yifan Zhou, Ping Ma, Xiaoming Zhai

TL;DR

This work tackles the high storage and maintenance costs of deploying separate neural networks for multiple automated scoring tasks by proposing GW-SMM, a model-merging framework guided by the Gromov-Wasserstein distance. By extracting robust response-feature representations from task-specific models and aligning their feature spaces through optimal transport, GW-SMM builds a principled merging plan and fuses models with a shared backbone while preserving task-specific heads. Empirical results on NGSS-aligned middle-school tasks show GW-SMM outperforms both human-knowledge and GPT-o1-based merging approaches in micro F1, macro F1, exact match, and per-label accuracy, and can reduce storage by up to threefold. The method demonstrates practical gains in efficiency for scalable automated scoring systems, while acknowledging a remaining gap to the pre-merge baseline and suggesting future enhancements such as adaptive layer-wise fusion.

Abstract

Automatic scoring of student responses enhances efficiency in education, but deploying a separate neural network for each task increases storage demands, maintenance efforts, and redundant computations. To address these challenges, this paper introduces the Gromov-Wasserstein Scoring Model Merging (GW-SMM) method, which merges models based on feature distribution similarities measured via the Gromov-Wasserstein distance. Our approach begins by extracting features from student responses using individual models, capturing both item-specific context and unique learned representations. The Gromov-Wasserstein distance then quantifies the similarity between these feature distributions, identifying the most compatible models for merging. Models exhibiting the smallest pairwise distances, typically in pairs or trios, are merged by combining only the shared layers preceding the classification head. This strategy results in a unified feature extractor while preserving separate classification heads for item-specific scoring. We validated our approach against human expert knowledge and a GPT-o1-based merging method. GW-SMM consistently outperformed both, achieving a higher micro F1 score, macro F1 score, exact match accuracy, and per-label accuracy. The improvements in micro F1 and per-label accuracy were statistically significant compared to GPT-o1-based merging (p=0.04, p=0.01). Additionally, GW-SMM reduced storage requirements by half without compromising much accuracy, demonstrating its computational efficiency alongside reliable scoring performance.

Efficient Multi-Task Inferencing: Model Merging with Gromov-Wasserstein Feature Alignment

TL;DR

This work tackles the high storage and maintenance costs of deploying separate neural networks for multiple automated scoring tasks by proposing GW-SMM, a model-merging framework guided by the Gromov-Wasserstein distance. By extracting robust response-feature representations from task-specific models and aligning their feature spaces through optimal transport, GW-SMM builds a principled merging plan and fuses models with a shared backbone while preserving task-specific heads. Empirical results on NGSS-aligned middle-school tasks show GW-SMM outperforms both human-knowledge and GPT-o1-based merging approaches in micro F1, macro F1, exact match, and per-label accuracy, and can reduce storage by up to threefold. The method demonstrates practical gains in efficiency for scalable automated scoring systems, while acknowledging a remaining gap to the pre-merge baseline and suggesting future enhancements such as adaptive layer-wise fusion.

Abstract

Automatic scoring of student responses enhances efficiency in education, but deploying a separate neural network for each task increases storage demands, maintenance efforts, and redundant computations. To address these challenges, this paper introduces the Gromov-Wasserstein Scoring Model Merging (GW-SMM) method, which merges models based on feature distribution similarities measured via the Gromov-Wasserstein distance. Our approach begins by extracting features from student responses using individual models, capturing both item-specific context and unique learned representations. The Gromov-Wasserstein distance then quantifies the similarity between these feature distributions, identifying the most compatible models for merging. Models exhibiting the smallest pairwise distances, typically in pairs or trios, are merged by combining only the shared layers preceding the classification head. This strategy results in a unified feature extractor while preserving separate classification heads for item-specific scoring. We validated our approach against human expert knowledge and a GPT-o1-based merging method. GW-SMM consistently outperformed both, achieving a higher micro F1 score, macro F1 score, exact match accuracy, and per-label accuracy. The improvements in micro F1 and per-label accuracy were statistically significant compared to GPT-o1-based merging (p=0.04, p=0.01). Additionally, GW-SMM reduced storage requirements by half without compromising much accuracy, demonstrating its computational efficiency alongside reliable scoring performance.

Paper Structure

This paper contains 12 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Workflow of the developed GW-SMM method.
  • Figure 2: Illustrative Multi-label Task: Gas-Filled Balloons
  • Figure 3: Heatmaps of the similarity scores of each method.
  • Figure 4: Performance results before and after merging using three methods (GW-SMM (Ours), human knowledge, and GPT-o1) based merging.