Multi-Part Object Representations via Graph Structures and Co-Part Discovery
Alex Foo, Wynne Hsu, Mong Li Lee
TL;DR
The paper introduces ECO-Net, an Explicit Compositional Network that represents multi-part objects as graphs of parts and learns to discover object wholes via a co-part object discovery algorithm. A memory module stores recurring objects to support downstream tasks, enabling robust occlusion handling and out-of-distribution generalization. Through extensive experiments on simulated, realistic, and real-world datasets, ECO-Net outperforms state-of-the-art methods in object discovery, occlusion-aware perception, and generalization, and its object representations improve downstream property prediction. The approach demonstrates the value of explicit part-whole structure for robust, interpretable object-centric perception in complex scenes.
Abstract
Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.
