Deep Learning Framework Testing via Heuristic Guidance Based on Multiple Model Measurements
Yinglong Zou, Juan Zhai, Chunrong Fang, Yanzhou Mu, Jiawei Liu, Zhenyu Chen
TL;DR
The paper addresses the inefficiencies and incomplete guidance in existing DL framework testing by introducing DLMMM, the first method to fuse multiple model measurements—bug-detection effectiveness, operator combination variety, and model execution time—via a CRITIC-based mechanism to drive a mutation-based, DAG-driven test-input generation pipeline. DLMMM demonstrates that higher operator variety improves bug detection but increases execution time, motivating a measurement fusion that balances these factors. Through extensive experiments across TensorFlow, PyTorch, and MindSpore, DLMMM achieves superior crash and NaN & inconsistency detection while improving testing efficiency and producing more diverse test inputs than four strong baselines. The work provides a practical, open-source framework for more thorough and time-efficient DL framework testing with data-driven guidance.
Abstract
Deep learning frameworks serve as the foundation for developing and deploying deep learning applications. To enhance the quality of deep learning frameworks, researchers have proposed numerous testing methods using deep learning models as test inputs. However, existing methods predominantly measure model bug detection effectiveness as heuristic indicators, presenting three critical limitations. Firstly, existing methods fail to quantitatively measure model's operator combination variety, potentially missing critical operator combinations that could trigger framework bugs. Secondly, existing methods neglect measuring and heuristically guiding the model execution time, resulting in the omission of numerous models potential for detecting more framework bugs within limited testing time. Thirdly, existing methods overlook correlation between different model measurements, relying simply on single-indicator heuristic guidance without considering their trade-offs. To overcome these limitations, we propose DLMMM, the first deep learning framework testing method to include multiple model measurements into heuristic guidance and fuse these measurements to achieve their trade-offs. DLMMM firstly quantitatively measures model's bug detection performance, operator combination variety, and model execution time. After that, DLMMM fuses these measurements based on their correlation to achieve their trade-offs. To further enhance testing effectiveness, DLMMM designs multi-level heuristic guidance for test input model generation. We apply DLMMM to test three widely used deep learning frameworks (including TensorFlow, PyTorch, and MindSpore). The experimental results show that DLMMM outperforms state-of-the-art methods in effectiveness and efficiency.
