DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data
Bin Wang, Linke Ouyang, Fan Wu, Wenchang Ning, Xiao Han, Zhiyuan Zhao, Jiahui Peng, Yiying Jiang, Dahua Lin, Conghui He
TL;DR
DSDL introduces a Data Set Description Language that unifies the description of multimodal and multitask AI datasets to reduce data handling complexity. By leveraging YAML/JSON, it separates structured dataset descriptions from large unstructured content through object locators, and provides a generic, portable, and extensible type system with libraries and templates. Key contributions include a formal core architecture, class domains, parametric struct/classes, unstructured object classes, and a library/import mechanism, all supported by tooling to publish, visualize, and train on datasets. The framework aims to streamline data dissemination and preprocessing, enabling more efficient AI development and data sharing across institutions and projects.
Abstract
In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly, requiring understanding and format conversion before it can be used by researchers or developers with different needs. To tackle this problem, this article introduces a framework called Dataset Description Language (DSDL) that aims to simplify dataset processing by providing a unified standard for AI datasets. DSDL adheres to the three basic practical principles of generic, portable, and extensible, using a unified standard to express data of different modalities and structures, facilitating the dissemination of AI data, and easily extending to new modalities and tasks. The standardized specifications of DSDL reduce the workload for users in data dissemination, processing, and usage. To further improve user convenience, we provide predefined DSDL templates for various tasks, convert mainstream datasets to comply with DSDL specifications, and provide comprehensive documentation and DSDL tools. These efforts aim to simplify the use of AI data, thereby improving the efficiency of AI development.
