Table of Contents
Fetching ...

Model Merging is Secretly Certifiable: Non-Vacuous Generalisation Bounds for Low-Shot Learning

Taehoon Kim, Henry Gouk, Minyoung Kim, Timothy Hospedales

TL;DR

The paper addresses the challenge of certifying IID generalisation for large neural networks in low-shot settings by linking model merging, a practical multi-source transfer approach, with PAC-Bayes generalisation bounds. By interpreting merging as a PAC-Bayes posterior and optionally optimizing bound-aware objectives, it demonstrates non-vacuous certificates for CLIP-ViT-B/32 and mistral-7B with as few as $100$ examples, and shows that data-dependent priors further tighten these guarantees. The results indicate that off-the-shelf merging methods can be made certifiable with modest adjustments, and that large models can be reliably certified in data-scarce regimes, with implications for trustworthy AI and regulatory compliance. The work suggests future directions toward sparse, merge-based representations and tighter integration of bound objectives into practical learning pipelines for scalable certification.

Abstract

Certifying the IID generalisation ability of deep networks is the first of many requirements for trusting AI in high-stakes applications from medicine to security. However, when instantiating generalisation bounds for deep networks it remains challenging to obtain non-vacuous guarantees, especially when applying contemporary large models on the small scale data prevalent in such high-stakes fields. In this paper, we draw a novel connection between a family of learning methods based on model fusion and generalisation certificates, and surprisingly show that with minor adjustment several existing learning strategies already provide non-trivial generalisation guarantees. Essentially, by focusing on data-driven learning of downstream tasks by fusion rather than fine-tuning, the certified generalisation gap becomes tiny and independent of the base network size, facilitating its certification. Our results show for the first time non-trivial generalisation guarantees for learning with as low as 100 examples, while using vision models such as VIT-B and language models such as mistral-7B. This observation is significant as it has immediate implications for facilitating the certification of existing systems as trustworthy, and opens up new directions for research at the intersection of practice and theory.

Model Merging is Secretly Certifiable: Non-Vacuous Generalisation Bounds for Low-Shot Learning

TL;DR

The paper addresses the challenge of certifying IID generalisation for large neural networks in low-shot settings by linking model merging, a practical multi-source transfer approach, with PAC-Bayes generalisation bounds. By interpreting merging as a PAC-Bayes posterior and optionally optimizing bound-aware objectives, it demonstrates non-vacuous certificates for CLIP-ViT-B/32 and mistral-7B with as few as examples, and shows that data-dependent priors further tighten these guarantees. The results indicate that off-the-shelf merging methods can be made certifiable with modest adjustments, and that large models can be reliably certified in data-scarce regimes, with implications for trustworthy AI and regulatory compliance. The work suggests future directions toward sparse, merge-based representations and tighter integration of bound objectives into practical learning pipelines for scalable certification.

Abstract

Certifying the IID generalisation ability of deep networks is the first of many requirements for trusting AI in high-stakes applications from medicine to security. However, when instantiating generalisation bounds for deep networks it remains challenging to obtain non-vacuous guarantees, especially when applying contemporary large models on the small scale data prevalent in such high-stakes fields. In this paper, we draw a novel connection between a family of learning methods based on model fusion and generalisation certificates, and surprisingly show that with minor adjustment several existing learning strategies already provide non-trivial generalisation guarantees. Essentially, by focusing on data-driven learning of downstream tasks by fusion rather than fine-tuning, the certified generalisation gap becomes tiny and independent of the base network size, facilitating its certification. Our results show for the first time non-trivial generalisation guarantees for learning with as low as 100 examples, while using vision models such as VIT-B and language models such as mistral-7B. This observation is significant as it has immediate implications for facilitating the certification of existing systems as trustworthy, and opens up new directions for research at the intersection of practice and theory.

Paper Structure

This paper contains 24 sections, 2 theorems, 15 equations, 6 figures, 6 tables.

Key Result

Theorem 1

For any PAC-Bayes posterior, $Q$, and PAC-Bayes prior, $P$, we have with confidence at least $1 - \delta$ that

Figures (6)

  • Figure 1: Impact of learning with bound-optimization and data-dependent prior. CLIP-ViT-B/32 on EuroSAT, GTSRB, SVHN, MNIST and DTD. Each point corresponds to a different combination of dataset and merging algorithm. CG Gap refers to the certified generalisation gap. (a) Certified Generalisation Gap vs. Train Error, (b) Certified Generalisation Gap vs. Test Error. Bound optimization is often crucial for non-vacuous results (white zone). (c) Test Error vs. PAC-Bayes Bound. Combining bound optimization with DD prior leads to a good tradeoff for LW-Adamerge.
  • Figure 2: Effect of adopting a data-dependent prior (DDP) with layer-wise Adamerging yangadamerging across 5 image classification datasets with Clip-ViT-B/32. Shaded cells indicate the regions with vacuous bounds.
  • Figure 3: Change of Performance Certificates with different total data size on 5 image classification datasets with CLIP-ViT-32/B. The DDP method refers to Task-wise AdaMerging with Data-Dependent Prior. In DDP, we use half of data for fitting a prior, and the rest for fitting a posterior and bound computation. In Half Val, we used half of data for model training and the rest for computing the "test set" (confidence interval) based bound.
  • Figure 4: Change of Actual/Certified Generalisation Gap by different total data size (100, 500, 1000, 2000, 4000) on 5 image classification datasets with CLIP-ViT-32/B. We utilise Bound optimisation approach with Task-wise Adamerging. Note that Act. Gap refers to Actual Generalisation Gap, computed by subtracting train error from test error, and Cert. Gap refers to Certified Generalisation Gap, computed by subtracting train error from PAC Bayes Bound. Notably, the test error can be certified to be within 5% of the training error.
  • Figure 5: Change of Metrics (Train Accuracy, Test Accuracy, PAC-Bayes Bound, PAC-Bayes Upper Bound and KL-Divergence) as the training proceeds on eurosat with Clip-ViT-B/32. LW-Ada refers to Layerwise Adamerging, LW-Ada+Bound refers to directly optimising PAC-Bayes bound with Layerwise Adamerging and LW-Ada+DDP refers to directly optimising PAC-Bayes bound using Data-dependent Prior (DDP) with Layerwise Adamerging.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2: langford2001boundsseeger2002