Table of Contents
Fetching ...

Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

Wenyi Li, Renkai Luo, Yue Yu, Huan-ang Gao, Mingju Gao, Li Yuan, Chaoyou Fu, Hao Zhao

Abstract

AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

Abstract

AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

Paper Structure

This paper contains 40 sections, 1 equation, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Left: We categorize coding tasks into a two-level hierarchy. Top-right: Example of a representative task and code generated by state-of-the-art models such as GPT-5 and Kimi-K2-Instruct, both achieving full unit-test correctness despite using different mathematically valid solutions. Bottom-right: Overall pass rates of eight leading open- and closed-source LLMs. The best model, GPT-5, reaches 36.6%.
  • Figure 2: The Benchmark Curation and Evaluation Pipeline of GeoCodeBench.
  • Figure 3: Comparison of General and Research Capability.
  • Figure 4: Case Study: Consistent Failure Across LLMs on a Simple Function. The function forward_event approximates “event accumulation” using the logarithmic intensity difference derived from two event-camera frames. Despite its brevity and simplicity, all tested LLMs failed.
  • Figure 5: Case Study: Creativity Correctness. The function compute_epipolar_distance requires calculating the symmetric epipolar distance between corresponding image points $\mathbf{p}_1$ and $\mathbf{p}_2$ given $T_{21}$ and $\mathbf{K}$. GPT-5 uses the Fundamental Matrix ($\mathbf{F}$) method ($\mathbf{l}_2 = \mathbf{F}\mathbf{p}_1$), operating directly on pixel coordinates. DeepSeek-R1, conversely, first transforms the inputs to normalized coordinates ($\mathbf{x}_1, \mathbf{x}_2$) and then applies the Essential Matrix ($\mathbf{E}$) method ($\mathbf{l}'_2 = \mathbf{E}\mathbf{x}_1$). Both approaches are mathematically equivalent ($\mathbf{F} = \mathbf{K}^{-T} \mathbf{E} \mathbf{K}^{-1}$) and yield the correct final distance, thus demonstrating Creative Correctness: models select distinct, valid pathways to achieve the required $3\text{D}$ geometry constraint.
  • ...and 11 more figures