Table of Contents
Fetching ...

MixBCT: Towards Self-Adapting Backward-Compatible Training

Yu Liang, Yufeng Zhang, Shiliang Zhang, Yaowei Wang, Sheng Xiao, Rong Xiao, Xiaoyu Wang

TL;DR

MixBCT tackles backward-compatible training for large-scale retrieval by enabling a new model to leverage distribution information from old features without updating the old gallery. It introduces a simple, unified approach that mixes old and new features and trains the new classifier on these mixed representations, with an adaptive constraint informed by old feature dispersion. The method uses a single loss term, optionally denoises old features to form a more robust mixed representation, and demonstrates strong gains over prior prototype-based and instance-based BCT methods on MS1Mv3 and IJB-C across open-set and cross-model scenarios. This yields a practical, backfill-free deployment path that remains effective across varying old-model qualities and data-split configurations.

Abstract

Backward-compatible training circumvents the need for expensive updates to the old gallery database when deploying an advanced new model in the retrieval system. Previous methods achieved backward compatibility by aligning prototypes of the new model with the old one, yet they often overlooked the distribution of old features, limiting their effectiveness when the low quality of the old model results in a weakly feature discriminability. Instance-based methods like L2 regression take into account the distribution of old features but impose strong constraints on the performance of the new model itself. In this paper, we propose MixBCT, a simple yet highly effective backward-compatible training method that serves as a unified framework for old models of varying qualities. We construct a single loss function applied to mixed old and new features to facilitate backward-compatible training, which adaptively adjusts the constraint domain for new features based on the distribution of old features. We conducted extensive experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C to verify the effectiveness of our method. The experimental results clearly demonstrate its superiority over previous methods. Code is available at https://github.com/yuleung/MixBCT .

MixBCT: Towards Self-Adapting Backward-Compatible Training

TL;DR

MixBCT tackles backward-compatible training for large-scale retrieval by enabling a new model to leverage distribution information from old features without updating the old gallery. It introduces a simple, unified approach that mixes old and new features and trains the new classifier on these mixed representations, with an adaptive constraint informed by old feature dispersion. The method uses a single loss term, optionally denoises old features to form a more robust mixed representation, and demonstrates strong gains over prior prototype-based and instance-based BCT methods on MS1Mv3 and IJB-C across open-set and cross-model scenarios. This yields a practical, backfill-free deployment path that remains effective across varying old-model qualities and data-split configurations.

Abstract

Backward-compatible training circumvents the need for expensive updates to the old gallery database when deploying an advanced new model in the retrieval system. Previous methods achieved backward compatibility by aligning prototypes of the new model with the old one, yet they often overlooked the distribution of old features, limiting their effectiveness when the low quality of the old model results in a weakly feature discriminability. Instance-based methods like L2 regression take into account the distribution of old features but impose strong constraints on the performance of the new model itself. In this paper, we propose MixBCT, a simple yet highly effective backward-compatible training method that serves as a unified framework for old models of varying qualities. We construct a single loss function applied to mixed old and new features to facilitate backward-compatible training, which adaptively adjusts the constraint domain for new features based on the distribution of old features. We conducted extensive experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C to verify the effectiveness of our method. The experimental results clearly demonstrate its superiority over previous methods. Code is available at https://github.com/yuleung/MixBCT .
Paper Structure (29 sections, 8 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 29 sections, 8 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: During the training phase, the old features are extracted by $\phi_o(\mathcal{D}_n)$ and mix them with new features $\phi_n(\mathcal{D}_n)$ to be fed into the new classifier $\psi_n$. Once backward-compatible training is complete, only the new model is required for feature extraction. It's worth noting that the dashed part of our workflow can be completed prior to conducting backward-compatible training.
  • Figure 2: Generally, consider any two classes $C^1$ and $C^2$, in each subgraph, the circle or ellipse is the distribution areas of the old features, $C_{old}^1$ and $C_{old}^2$ are the old class centers of class $C^1$ and class $C^2$ respectively. As we move from left to right, the quality of the old model decreases progressively. The light blue dashed line is the boundary hyperplane used by the current old prototype-based method to constrain the new feature, and it is the midperpendicular of the line between two classes. Specifically, the old prototype-based method restricts new features belong to $C^1$ to the left of the light blue dashed line, and new features belong to $C^2$ to the right. The light orange dashed line represents the boundary hyperplane where our method constrains the new model's classifier weights. The MixBCT imposes the constraints: $d(\phi_{o/n}^1(x); C_{new}^1) < d(\phi_{o/n}^1(x); C_{new}^2)$ and $d(\phi_{o/n}^2(x); C_{new}^2) < d(\phi_{o/n}^2(x); C_{new}^1)$, it contains constraints: $d(\phi_{o}^1(x); C_{new}^1) < d(\phi_{o}^1(x); C_{new}^2)$ and $d(\phi_{o}^2(x); C_{new}^2) < d(\phi_{o}^2(x); C_{new}^1)$. Thus, the area not enclosed by the light orange dashed line forms the feasible domain.
  • Figure 3: Open-Class scenario, compare the 1:1 verification(left) and 1:N identification(right) performance on IJB-C face recognition benchmark with state-of-the-art methods in the setting of old model with different quality. ‘CT' denotes ‘cross-test', measuring backward-compatible performance, while ‘ST' stands for ‘self-test', measuring the negative impact of backward-compatible training on new models. The x-axis on the graph represents the model quality, where points located further to the right indicate lower quality of the old model and higher within-class variance of the old embedding.
  • Figure 4: Feature visualization with different colors for different classes when the old model has low quality, where old feature can't separate well. ‘$\blacktriangledown$': Old, ‘$\bigstar$': NCCL, ‘$\times$': UniBCT, ‘$\blacksquare$': AdvBCT ,‘$\bullet$': MixBCT
  • Figure 5: Denoising operation visualizations, different colors represent different categories, and ‘$\bullet$' represents noisy samples.
  • ...and 1 more figures