UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition
Xinyu Nan, Lingtao Mao, Huangyu Dai, Zexin Zheng, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, Han Li
TL;DR
UniDGF tackles the problem of unified visual understanding by marrying object detection with a hierarchical, generative prediction mechanism. It uses a YOLO-based detector to localize objects, extracts ROI features, and employs a BART-based generator to produce a coarse-to-fine sequence of hierarchical category and attribute tokens, including property-conditioned attribute values. The approach is supported by a two-stage data labeling pipeline that builds the Products7417 dataset, enabling large-vocabulary, fine-grained annotations. Across open-source and proprietary e-commerce datasets, UniDGF demonstrates strong improvements in both semantic prediction and end-to-end detection, outperforming embedding-based, retrieval-based, and multimodal LLM baselines and offering coherent, unified inference with practical impact for large-scale attribute-rich tasks.
Abstract
Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.
