VisualCoder: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning
Cuong Chi Le, Hoang-Chau Truong-Vinh, Huy Nhat Phan, Dung Duy Le, Tien N. Nguyen, Nghi D. Q. Bui
TL;DR
VisualCoder addresses the gap in code execution reasoning by fusing code with visual Control Flow Graphs (CFGs) through a Reference CoT mechanism that explicitly links each line of code to its CFG node. This grounding mitigates cascading errors seen in naive multimodal CoT prompts and improves dynamic program understanding, error detection, and fault localization across multiple tasks and models. Empirical results show that CFG images outperform text-based CFGs, and the combination of CFGs with Reference CoT yields robust gains, particularly when paired with Multimodal-CoT. The work demonstrates the practical potential of multimodal reasoning for software debugging and analysis, highlighting improved alignment between textual and visual execution cues.
Abstract
Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static syntax, they often struggle with dynamic reasoning tasks. We introduce VisualCoder, a simple yet effective approach that enhances code reasoning by integrating multimodal Chain-of-Thought (CoT) reasoning with a visual Control Flow Graph (CFG). By aligning code snippets with their corresponding CFGs, VisualCoder provides deeper insights into execution flows. We address challenges in multimodal CoT integration through a reference mechanism, ensuring consistency between code and its execution path, thereby improving performance in program behavior prediction, error detection, and output generation.
