Table of Contents
Fetching ...

NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

Danial Kamali, Parisa Kordjamshidi

TL;DR

NePTune presents a neuro-symbolic framework that unifies Python-based imperative reasoning with soft, differentiable logic to solve visual-language queries under perceptual uncertainty. By generating executable Python programs via an LLM and grounding predicates with a two-tier perceptual module, NePTune decouples perception from reasoning while enabling zero-shot generalization and domain adaptation. Empirical results on CLEVR, CLEVR-Humans, and real-world REG benchmarks demonstrate significant improvements over strong baselines, with robust performance under domain shifts and potential for neuro-symbolic fine-tuning. The work highlights the value of hybrid execution—combining probabilistic grounding with programmable control—for robust compositional reasoning in vision-language tasks.

Abstract

Modern Vision-Language Models (VLMs) have achieved impressive performance in various tasks, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or predefined predicates, which limit flexibility. In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybrid execution model that integrates the perception capabilities of foundation vision models with the compositional expressiveness of symbolic reasoning. NePTune dynamically translates natural language queries into executable Python programs that blend imperative control flow with soft logic operators capable of reasoning over VLM-generated uncertainty. Operating in a training-free manner, NePTune, with a modular design, decouples perception from reasoning, yet its differentiable operations support fine-tuning. We evaluate NePTune on multiple visual reasoning benchmarks and various domains, utilizing adversarial tests, and demonstrate a significant improvement over strong base models, as well as its effective compositional generalization and adaptation capabilities in novel environments.

NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

TL;DR

NePTune presents a neuro-symbolic framework that unifies Python-based imperative reasoning with soft, differentiable logic to solve visual-language queries under perceptual uncertainty. By generating executable Python programs via an LLM and grounding predicates with a two-tier perceptual module, NePTune decouples perception from reasoning while enabling zero-shot generalization and domain adaptation. Empirical results on CLEVR, CLEVR-Humans, and real-world REG benchmarks demonstrate significant improvements over strong baselines, with robust performance under domain shifts and potential for neuro-symbolic fine-tuning. The work highlights the value of hybrid execution—combining probabilistic grounding with programmable control—for robust compositional reasoning in vision-language tasks.

Abstract

Modern Vision-Language Models (VLMs) have achieved impressive performance in various tasks, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or predefined predicates, which limit flexibility. In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybrid execution model that integrates the perception capabilities of foundation vision models with the compositional expressiveness of symbolic reasoning. NePTune dynamically translates natural language queries into executable Python programs that blend imperative control flow with soft logic operators capable of reasoning over VLM-generated uncertainty. Operating in a training-free manner, NePTune, with a modular design, decouples perception from reasoning, yet its differentiable operations support fine-tuning. We evaluate NePTune on multiple visual reasoning benchmarks and various domains, utilizing adversarial tests, and demonstrate a significant improvement over strong base models, as well as its effective compositional generalization and adaptation capabilities in novel environments.

Paper Structure

This paper contains 42 sections, 2 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: A natural language query ("Is there a brown object behind every red sphere?") is decomposed into symbolic concepts, such as $red$, and $sphere$. These concepts are then composed to enable explicit reasoning over objects and their relations. This illustrates how complex queries can be mapped into structured logical forms.
  • Figure 2: NePTune overview. Given an image and a query, the (1) LLM-based Program Generation converts the natural language query to a Pythonic program. Then (2) Perceptual Grounding extracts the object bounding boxes. The (3) Symbolic Executor then runs the Python code to reason over concepts extracted from the VLM using both soft composition and imperative logic to derive the final answer.
  • Figure 3: Qualitative examples of NePTune on the RefGTA dataset. Green boxes indicate objects detected by Grounding DINO, blue boxes show objects selected by the VLM (InternVL2.5-8B), and red boxes highlight the final selections made by NePTune.
  • Figure 4: Qualitative examples of NePTune compared to ViperGPT and NeSyCoCo on CLEVR-Humans.
  • Figure 5: Qualitative example of wrong NePTune program generation. Mistakes are highlighted with red boxes.
  • ...and 2 more figures