CodeV: Issue Resolving with Visual Data
Linhao Zhang, Daoguang Zan, Quanshun Yang, Zhirong Huang, Dong Chen, Bo Shen, Tianyu Liu, Yongshun Gong, Pengjie Huang, Xudong Lu, Guangtai Liang, Lizhen Cui, Qianxiang Wang
TL;DR
GitHub issue resolving has largely relied on textual signals, neglecting visual data that can convey crucial context. CodeV introduces a two-phase multimodal framework that first converts issue visuals into fine-grained descriptions and a structured summary, then uses this enriched representation to generate patches via LLMs, achieving strong gains over text-only baselines. The authors also provide Visual SWE-bench, a 133-instance benchmark across 11 repositories to evaluate visual issue resolving, and demonstrate substantial improvements (e.g., substantial relative gains over Agentless) with robust performance across varying Vision-Language Model sizes. This work highlights the practical value of visual data in software repair tasks and offers a standardized benchmark to propel future multimodal approaches in code-related AI systems.
Abstract
Large Language Models (LLMs) have advanced rapidly in recent years, with their applications in software engineering expanding to more complex repository-level tasks. GitHub issue resolving is a key challenge among these tasks. While recent approaches have made progress on this task, they focus on textual data within issues, neglecting visual data. However, this visual data is crucial for resolving issues as it conveys additional knowledge that text alone cannot. We propose CodeV, the first approach to leveraging visual data to enhance the issue-resolving capabilities of LLMs. CodeV resolves each issue by following a two-phase process: data processing and patch generation. To evaluate CodeV, we construct a benchmark for visual issue resolving, namely Visual SWE-bench. Through extensive experiments, we demonstrate the effectiveness of CodeV, as well as provide valuable insights into leveraging visual data to resolve GitHub issues.
