CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding
Xiaoyu Deng, Zhengjian Kang, Xintao Li, Yongzhe Zhang, Tianmin Guo
TL;DR
CoVis tackles the limitation that image interpretation is biased by an observer's background and by information silos. It integrates a cascaded dual-layer segmentation pipeline (coarse segmentation with FastSAM followed by fine-grained segmentation with U-Net) with an LLM-based content generator guided by prompt engineering to produce rich, multi-dimensional visual descriptions. Quantitative segmentation metrics and qualitative human evaluations show CoVis outperforms baselines and generalizes across datasets, delivering more comprehensive visual analytics than general-purpose LLMs. The approach has practical implications for CSCW and accessibility, with future work aimed at personalizing content generation to user preferences.
Abstract
Graphic visual content helps in promoting information communication and inspiration divergence. However, the interpretation of visual content currently relies mainly on humans' personal knowledge background, thereby affecting the quality and efficiency of information acquisition and understanding. To improve the quality and efficiency of visual information transmission and avoid the limitation of the observer due to the information cocoon, we propose CoVis, a collaborative framework for fine-grained visual understanding. By designing and implementing a cascaded dual-layer segmentation network coupled with a large-language-model (LLM) based content generator, the framework extracts as much knowledge as possible from an image. Then, it generates visual analytics for images, assisting observers in comprehending imagery from a more holistic perspective. Quantitative experiments and qualitative experiments based on 32 human participants indicate that the CoVis has better performance than current methods in feature extraction and can generate more comprehensive and detailed visual descriptions than current general-purpose large models.
