Table of Contents
Fetching ...

Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols

Gertjan Burghouts, Fieke Hillerström, Erwin Walraven, Michael van Bekkum, Frank Ruis, Joris Sijs, Jelle van Mil, Judith Dijk

TL;DR

Open-world visual reasoning requires locating configurations of objects that may be unseen during training. The paper fuses first-order logic with a neuro-symbolic program and grounding from language-vision foundation models (e.g., CLIP) to ground symbols and validate spatial relations in images in a zero-shot setting. It introduces multi-scale, open-world inference to handle varying distances and object proposals, demonstrating zero-shot localization of configurations like 'tool on floor' and 'leaking pipe' with strong ROC performance and informative ablations. The approach offers flexible, training-free deployment for robotic inspection, though remaining errors mainly stem from biases in symbol proposals learned by the language-vision model, suggesting prompt-tuning as a potential remedy.

Abstract

We consider the problem of finding spatial configurations of multiple objects in images, e.g., a mobile inspection robot is tasked to localize abandoned tools on the floor. We define the spatial configuration of objects by first-order logic in terms of relations and attributes. A neuro-symbolic program matches the logic formulas to probabilistic object proposals for the given image, provided by language-vision models by querying them for the symbols. This work is the first to combine neuro-symbolic programming (reasoning) and language-vision models (learning) to find spatial configurations of objects in images in an open world setting. We show the effectiveness by finding abandoned tools on floors and leaking pipes. We find that most prediction errors are due to biases in the language-vision model.

Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols

TL;DR

Open-world visual reasoning requires locating configurations of objects that may be unseen during training. The paper fuses first-order logic with a neuro-symbolic program and grounding from language-vision foundation models (e.g., CLIP) to ground symbols and validate spatial relations in images in a zero-shot setting. It introduces multi-scale, open-world inference to handle varying distances and object proposals, demonstrating zero-shot localization of configurations like 'tool on floor' and 'leaking pipe' with strong ROC performance and informative ablations. The approach offers flexible, training-free deployment for robotic inspection, though remaining errors mainly stem from biases in symbol proposals learned by the language-vision model, suggesting prompt-tuning as a potential remedy.

Abstract

We consider the problem of finding spatial configurations of multiple objects in images, e.g., a mobile inspection robot is tasked to localize abandoned tools on the floor. We define the spatial configuration of objects by first-order logic in terms of relations and attributes. A neuro-symbolic program matches the logic formulas to probabilistic object proposals for the given image, provided by language-vision models by querying them for the symbols. This work is the first to combine neuro-symbolic programming (reasoning) and language-vision models (learning) to find spatial configurations of objects in images in an open world setting. We show the effectiveness by finding abandoned tools on floors and leaking pipes. We find that most prediction errors are due to biases in the language-vision model.
Paper Structure (11 sections, 9 equations, 5 figures)

This paper contains 11 sections, 9 equations, 5 figures.

Figures (5)

  • Figure 1: Finding spatial configurations of object categories in an open world. A configuration is specified by first-order logic. The symbols relate to (possibly novel) objects that are extracted from images in a zero-shot, probabilistic manner. A neuro-symbolic program validates hypotheses.
  • Figure 2: Tool on floor: good predictions.
  • Figure 3: Tool on floor: errors.
  • Figure 4: Leaking pipe.
  • Figure 5: ROC curves for abandoned tool on floor. The neuro-symbolic program is more effective than alternative combinations of tool and floor. Spatial information and multi-scale reasoning are helpful.