Table of Contents
Fetching ...

Joint Neural Networks for One-shot Object Recognition and Detection

Camilo J. Vargas, Qianni Zhang, Ebroul Izquierdo

TL;DR

The paper tackles one-shot object recognition and detection when train/test classes do not overlap. It introduces Joint Neural Networks (JNN) that learn pairwise matching through joint convolutional layers across two input branches, producing a similarity score $p(x^{(1)},x^{(2)})$ trained with Binary Cross Entropy loss. Recognition uses an AlexNet-based backbone, while detection adopts a Darknet19/YOLO2-inspired one-shot detector with an $S\times S$ grid and anchor boxes. On MiniImageNet, JNN achieves $\$61.41\%$ accuracy for one-shot recognition, and on COCO→VOC one-shot detection, it reaches $\$47.1\%$ mAP, demonstrating competitive performance without support-set fine-tuning and highlighting the approach’s scalability to unseen classes.

Abstract

This paper presents a novel joint neural networks approach to address the challenging one-shot object recognition and detection tasks. Inspired by Siamese neural networks and state-of-art multi-box detection approaches, the joint neural networks are able to perform object recognition and detection for categories that remain unseen during the training process. Following the one-shot object recognition/detection constraints, the training and testing datasets do not contain overlapped classes, in other words, all the test classes remain unseen during training. The joint networks architecture is able to effectively compare pairs of images via stacked convolutional layers of the query and target inputs, recognising patterns of the same input query category without relying on previous training around this category. The proposed approach achieves 61.41% accuracy for one-shot object recognition on the MiniImageNet dataset and 47.1% mAP for one-shot object detection when trained on the COCO dataset and tested using the Pascal VOC dataset. Code available at https://github.com/cjvargasc/JNN recog and https://github.com/cjvargasc/JNN detection/

Joint Neural Networks for One-shot Object Recognition and Detection

TL;DR

The paper tackles one-shot object recognition and detection when train/test classes do not overlap. It introduces Joint Neural Networks (JNN) that learn pairwise matching through joint convolutional layers across two input branches, producing a similarity score trained with Binary Cross Entropy loss. Recognition uses an AlexNet-based backbone, while detection adopts a Darknet19/YOLO2-inspired one-shot detector with an grid and anchor boxes. On MiniImageNet, JNN achieves 61.41\%\ mAP, demonstrating competitive performance without support-set fine-tuning and highlighting the approach’s scalability to unseen classes.

Abstract

This paper presents a novel joint neural networks approach to address the challenging one-shot object recognition and detection tasks. Inspired by Siamese neural networks and state-of-art multi-box detection approaches, the joint neural networks are able to perform object recognition and detection for categories that remain unseen during the training process. Following the one-shot object recognition/detection constraints, the training and testing datasets do not contain overlapped classes, in other words, all the test classes remain unseen during training. The joint networks architecture is able to effectively compare pairs of images via stacked convolutional layers of the query and target inputs, recognising patterns of the same input query category without relying on previous training around this category. The proposed approach achieves 61.41% accuracy for one-shot object recognition on the MiniImageNet dataset and 47.1% mAP for one-shot object detection when trained on the COCO dataset and tested using the Pascal VOC dataset. Code available at https://github.com/cjvargasc/JNN recog and https://github.com/cjvargasc/JNN detection/
Paper Structure (10 sections, 9 equations, 8 figures, 9 tables)

This paper contains 10 sections, 9 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Proposed joint layer structure.
  • Figure 2: Proposed AlexNet-based recognition architecture.
  • Figure 3: Anchor window definition YOLO2. Where $(b_x, b_y)$ and $(b_w, b_h)$ define the center coordinates and size of the anchor window respectively
  • Figure 4: JNN DarkNet19-based detection architecture.
  • Figure 5: Subset sample from the MiniImageNet (row 1), QMUL-OpenLogo (row 2), Pascal VOC (row 3), and COCO (row 4) datasets.
  • ...and 3 more figures