Table of Contents
Fetching ...

OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning

Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, Patrick Pérez

TL;DR

The paper addresses unsupervised image representation learning by introducing OBoW, a fully online teacher–student framework that reconstructs a dynamic bag-of-visual-words target from perturbed inputs. Key innovations include online EMA-based teacher updates, a queue-based online vocabulary, a dynamic BoW-prediction head, and multi-scale BoW targets with aggressive augmentations to foster contextual reasoning. Empirical results across ImageNet, Places205, VOC07, and COCO demonstrate state-of-the-art or competitive performance in linear, few-shot, and downstream tasks, with notable efficiency advantages over prior methods. The work advances unsupervised learning by combining BoW-guided reconstruction with online adaptability, offering strong transfer capabilities and practical applicability.

Abstract

Learning image representations without human supervision is an important and active research field. Several recent approaches have successfully leveraged the idea of making such a representation invariant under different types of perturbations, especially via contrastive-based instance discrimination training. Although effective visual representations should indeed exhibit such invariances, there are other important characteristics, such as encoding contextual reasoning skills, for which alternative reconstruction-based approaches might be better suited. With this in mind, we propose a teacher-student scheme to learn representations by training a convolutional net to reconstruct a bag-of-visual-words (BoW) representation of an image, given as input a perturbed version of that same image. Our strategy performs an online training of both the teacher network (whose role is to generate the BoW targets) and the student network (whose role is to learn representations), along with an online update of the visual-words vocabulary (used for the BoW targets). This idea effectively enables fully online BoW-guided unsupervised learning. Extensive experiments demonstrate the interest of our BoW-based strategy which surpasses previous state-of-the-art methods (including contrastive-based ones) in several applications. For instance, in downstream tasks such Pascal object detection, Pascal classification and Places205 classification, our method improves over all prior unsupervised approaches, thus establishing new state-of-the-art results that are also significantly better even than those of supervised pre-training. We provide the implementation code at https://github.com/valeoai/obow.

OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning

TL;DR

The paper addresses unsupervised image representation learning by introducing OBoW, a fully online teacher–student framework that reconstructs a dynamic bag-of-visual-words target from perturbed inputs. Key innovations include online EMA-based teacher updates, a queue-based online vocabulary, a dynamic BoW-prediction head, and multi-scale BoW targets with aggressive augmentations to foster contextual reasoning. Empirical results across ImageNet, Places205, VOC07, and COCO demonstrate state-of-the-art or competitive performance in linear, few-shot, and downstream tasks, with notable efficiency advantages over prior methods. The work advances unsupervised learning by combining BoW-guided reconstruction with online adaptability, offering strong transfer capabilities and practical applicability.

Abstract

Learning image representations without human supervision is an important and active research field. Several recent approaches have successfully leveraged the idea of making such a representation invariant under different types of perturbations, especially via contrastive-based instance discrimination training. Although effective visual representations should indeed exhibit such invariances, there are other important characteristics, such as encoding contextual reasoning skills, for which alternative reconstruction-based approaches might be better suited. With this in mind, we propose a teacher-student scheme to learn representations by training a convolutional net to reconstruct a bag-of-visual-words (BoW) representation of an image, given as input a perturbed version of that same image. Our strategy performs an online training of both the teacher network (whose role is to generate the BoW targets) and the student network (whose role is to learn representations), along with an online update of the visual-words vocabulary (used for the BoW targets). This idea effectively enables fully online BoW-guided unsupervised learning. Extensive experiments demonstrate the interest of our BoW-based strategy which surpasses previous state-of-the-art methods (including contrastive-based ones) in several applications. For instance, in downstream tasks such Pascal object detection, Pascal classification and Places205 classification, our method improves over all prior unsupervised approaches, thus establishing new state-of-the-art results that are also significantly better even than those of supervised pre-training. We provide the implementation code at https://github.com/valeoai/obow.

Paper Structure

This paper contains 22 sections, 6 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Unsupervised learning with Bag-of-Words guidance. Two encoders $\mathrm{T}$ and $\mathrm{S}$ learn at different tempos by interacting and learning from each other. An image $\mathbf{x}$ is passed through the encoder $\mathrm{T}$ and its output feature maps $\mathrm{T}^{\ell}(\mathbf{x})$ are embedded into a BoW representation $y_{\mathrm{T}}(\mathbf{x})$ over a vocabulary $V$ of features from $\mathrm{T}$. The vocabulary $V$ is updated at each step. The encoder $\mathrm{S}$ aims to reconstruct $y_{\mathrm{T}}(\mathbf{x})$ from data-augmented instances $\tilde{\mathbf{x}}$. A dynamic BoW-prediction head learns to leverage the continuously updated vocabulary $V$ to compute the BoW representation from the features $\mathrm{S}(\tilde{\mathbf{x}})$. $\mathrm{T}$ follows slowly the learning trajectory of $\mathrm{S}$ via momentum updates.
  • Figure 2: Vocabulary queue from randomly sampled local features. For each input image $\mathbf{x}$ to $\mathrm{T}$, "local" features are pooled from $\mathrm{T}^{\ell}(\mathbf{x})$ by averaging over $3 \times 3$ sliding windows. One of the resulting vectors is selected randomly and added as visual word to the vocabulary queue, replacing the oldest word in the vocabulary.
  • Figure 3: Dynamic BoW-prediction head.$\mathrm{G}(\cdot)$ learns to quickly adapt to the visual words in the continuously refreshed vocabulary $V$. The outputs $\mathrm{G}(V)$ are in fact weights that are used for mapping the features $\mathrm{S}(\tilde{\mathbf{x}})$ to the corresponding BoW vector $y_{\mathrm{S}}(\tilde{\mathbf{x}})$.
  • Figure 4: Reconstructing BoWs from small parts of the original image. Given a training image (left), we extract two types of image crops. The first type (middle) is obtained by randomly sampling an image region whose area covers no more than $60\%$ of the entire image, resizing it to $160 \times 160$ and then giving it as input to the student as part of the reconstruction task. The second type (right) is obtained by randomly selecting an area that covers between $60\%$ to $100\%$ of the entire image, resizing it to a $256 \times 256$ image, dividing it into $3 \times 3$ overlapping patches of size $96 \times 96$, and randomly choosing $5$ out of these $9$ patches (indicated with red rectangles) that are given as $5$ separate inputs to the student. The student must then reconstruct the original BoW target independently for each patch. The blue rectangle on the left image indicates the central $224 \times 224$ crop from which the teacher produces the BoW target. Note that, except from horizontal flipping, no other perturbation is applied on the teacher's inputs.
  • Figure 5: Examples of visual-word members from the conv5 layer of ResNet50. The visualizations are created by using the state of the queue-based visual-words vocabulary at the end of training. For each visual word, we depict the 8 image patches retrieved from ImageNet with the highest assignment score for that word.
  • ...and 1 more figures