Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

Andrei Atanov; Jiawei Fu; Rishubh Singh; Isabella Yu; Andrew Spielberg; Amir Zamir

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

Andrei Atanov, Jiawei Fu, Rishubh Singh, Isabella Yu, Andrew Spielberg, Amir Zamir

TL;DR

This work questions the camera-centric paradigm by showing that extremely simple photoreceptor sensors, when strategically designed, can achieve competitive performance on active vision tasks. It introduces a computational model for photoreceptors, a design space with per-sensor seven parameters, and three design strategies (random, intuitive, computational). A joint design-control optimization framework trains a design policy alongside a generalist control policy, enabling automatic discovery of effective sensor layouts that often approach or match camera baselines while using far less sensory bandwidth. The study validates findings across visual navigation, continuous control, and real-world deployment, and contrasts computational designs with human intuition, highlighting the potential for sensor-efficient, robust perception in robotics. Overall, the results suggest a promising avenue for lightweight sensing that may complement or substitute traditional cameras in resource-constrained settings.

Abstract

A de facto standard in solving computer vision problems is to use a common high-resolution camera and choose its placement on an agent (i.e., position and orientation) based on human intuition. On the other hand, extremely simple and well-designed visual sensors found throughout nature allow many organisms to perform diverse, complex behaviors. In this work, motivated by these examples, we raise the following questions: 1. How effective simple visual sensors are in solving vision tasks? 2. What role does their design play in their effectiveness? We explore simple sensors with resolutions as low as one-by-one pixel, representing a single photoreceptor First, we demonstrate that just a few photoreceptors can be enough to solve many tasks, such as visual navigation and continuous control, reasonably well, with performance comparable to that of a high-resolution camera. Second, we show that the design of these simple visual sensors plays a crucial role in their ability to provide useful information and successfully solve these tasks. To find a well-performing design, we present a computational design optimization algorithm and evaluate its effectiveness across different tasks and domains, showing promising results. Finally, we perform a human survey to evaluate the effectiveness of intuitive designs devised manually by humans, showing that the computationally found design is among the best designs in most cases.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

TL;DR

Abstract

Paper Structure (36 sections, 4 equations, 28 figures, 3 tables)

This paper contains 36 sections, 4 equations, 28 figures, 3 tables.

Introduction
Related Work
The Photoreceptor Sensor: Computational Model and Design Space
Computational Model of a Photoreceptor
Design of Visual Sensors
Simple Photoreceptors are Effective Visual Sensors
Experimental Setting
Photoreceptors Achieve Performance Close to a Camera
Visual Sensors Design Optimization
Design is Important for the Effectiveness of Photoreceptors
Computational Design via Joint Optimization
Design Optimization Experiments
Intuitive Designs
Do designs transfer between tasks?
Evaluation in the Real World
...and 21 more sections

Figures (28)

Figure 1: Extremely simple photoreceptor sensors can solve vision tasks reasonably well, comparable to a high-resolution camera.Top: We use photoreceptor sensors with a resolution as low as $1 \times 1$, whose dimensionality is 16384 times lower than that of a $128\times 128$ camera sensor (for visualization purposes, the displayed grid in the figure underestimates this factor). Bottom: We find that even a handful of well-placed photoreceptors can provide sufficient information to solve some vision tasks with reasonably good performance - significantly higher than a blind agent and similar to a more complex camera sensor. Our evaluation suite consists of eight vision-based active tasks, including visual navigation using scans of real buildings from the MatterPort3D dataset chang_matterport3d_2017 and continuous control tasks from the DeepMind Control suite tassa_deepmind_2018.
Figure 2: Simple photoreceptor sensor for active vision tasks.Left: The design space of visual sensors (PR or camera). We vary the extrinsic (position and orientation) and intrinsic (field of view) parameters for each sensor (either a single PR or a $B\times B$ grid with shared extrinsic parameters). We constrain the position of a sensor to the agent's body. Center: To implement the PR sensor computationally in common simulators, we render a camera view (e.g., using a pinhole camera model) with the corresponding design parameters and average the signal spatially. For a grid sensor, we split an image into equal patches and average each of them spatially to get readings for the corresponding $B^2$ PR sensors. Right: Finally, we pass observations from all sensors (along with GPS+Compass for navigation tasks) through a Transformer encoder to predict the action $a$ that optimizes a task-specific reward function.
Figure 3: Photoreceptors are effective visual sensors for navigation tasks. We compare the performance of agents trained with different visual sensors - a varied number of photoreceptors, a camera, or no visual sensor (intelligent blind) - on visual navigation tasks. When scaling the number of PRs, we use configurations of $K\in\{2,4\}$ grids of sizes 4$\times$4 and 8$\times$8. In all cases, we report the best design found by our design optimization method (see \ref{['sec:dopt']}), including the camera design. For the camera baseline, we report performance when using the same shallow 3-layer Transformer encoder as for PRs and the ResNet-50 backbone, a default choice in the literature ("gold standard"), for a fair comparison. Even with a handful of photoreceptors, PR agents significantly outperform blind agents and achieve performance closer to or better than that of the camera agent with the same shallow encoder (getting close to the gold standard.)
Figure 4: Best photoreceptor design visualizations. We visualize the best-performing photoreceptor designs for both PointGoalNav (left) and TargetNav (right) tasks. These are computational designs found by the proposed design optimization method (see \ref{['sec:dopt']}.) Both designs contain a total of 128 PRs in the configuration of $K=2$ grids of size 8$\times$8. While the depicted designs might appear unintuitively irregular, they both result in good performance as seen in \ref{['fig:cam-vs-pr-nav']} and improve upon the random design initialization (see \ref{['fig:dopt-perf']}). In addition, \ref{['fig:pr-spread']} shows that a random, uninformed design statistically does not lead to similar high performance. Therefore, there is a specific structure to this design, albeit hard to understand intuitively.
Figure 5: Left: In PointGoalNav, photoreceptors enable collision avoidance and choose efficient trajectories. For each agent, we plot the trajectories from two episodes from unseen test scenes. The red dots denote actions that result in a collision. We find that the PR agent can avoid collisions and choose an efficient trajectory similar to that of the camera agent. Right: In TargetNav, photoreceptors enable efficient exploration and target detection. We plot 50 trajectories for each agent and an unseen test scene. PR agents are able to explore novel scenes efficiently (see spread dark points that indicate early steps in the episode compared to the blind agent) and successfully find the target in most cases, approaching the performance of the camera agent.
...and 23 more figures

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

TL;DR

Abstract

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

Authors

TL;DR

Abstract

Table of Contents

Figures (28)