Table of Contents
Fetching ...

Pushing the boundary on Natural Language Inference

Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho

TL;DR

This work tackles the brittleness of natural language inference by eliminating the need for human-labeled rationales and applying Group Relative Policy Optimization to train chain-of-thought reasoning in large language models. By fine-tuning 7B, 14B, and 32B models with LoRA/QLoRA and AWQ quantization, the authors achieve strong performance on standard NLI benchmarks and set new state-of-the-art results on several adversarial datasets, notably with the 32B AWQ model within a 22GB memory footprint. The approach demonstrates robust generalization across diverse data regimes and shows that larger base models combined with GRPO yield the best gains, while quantization can be effectively mitigated by parameter-efficient adapters. Overall, the paper provides a scalable, practical pathway for deploying robust NLI systems that maintain high inference quality in memory-constrained environments.

Abstract

Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering, and information retrieval. Despite its importance, current NLI systems heavily rely on supervised learning with datasets that often contain annotation artifacts and biases, limiting generalization and real-world applicability. In this work, we apply a reinforcement learning-based approach using Group Relative Policy Optimization (GRPO) for Chain-of-Thought (CoT) learning in NLI, eliminating the need for labeled rationales and enabling this type of training on more challenging datasets such as ANLI. We fine-tune 7B, 14B, and 32B language models using parameter-efficient techniques (LoRA and QLoRA), demonstrating strong performance across standard and adversarial NLI benchmarks. Our 32B AWQ-quantized model surpasses state-of-the-art results on 7 out of 11 adversarial sets$\unicode{x2013}$or on all of them considering our replication$\unicode{x2013}$within a 22GB memory footprint, showing that robust reasoning can be retained under aggressive quantization. This work provides a scalable and practical framework for building robust NLI systems without sacrificing inference quality.

Pushing the boundary on Natural Language Inference

TL;DR

This work tackles the brittleness of natural language inference by eliminating the need for human-labeled rationales and applying Group Relative Policy Optimization to train chain-of-thought reasoning in large language models. By fine-tuning 7B, 14B, and 32B models with LoRA/QLoRA and AWQ quantization, the authors achieve strong performance on standard NLI benchmarks and set new state-of-the-art results on several adversarial datasets, notably with the 32B AWQ model within a 22GB memory footprint. The approach demonstrates robust generalization across diverse data regimes and shows that larger base models combined with GRPO yield the best gains, while quantization can be effectively mitigated by parameter-efficient adapters. Overall, the paper provides a scalable, practical pathway for deploying robust NLI systems that maintain high inference quality in memory-constrained environments.

Abstract

Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering, and information retrieval. Despite its importance, current NLI systems heavily rely on supervised learning with datasets that often contain annotation artifacts and biases, limiting generalization and real-world applicability. In this work, we apply a reinforcement learning-based approach using Group Relative Policy Optimization (GRPO) for Chain-of-Thought (CoT) learning in NLI, eliminating the need for labeled rationales and enabling this type of training on more challenging datasets such as ANLI. We fine-tune 7B, 14B, and 32B language models using parameter-efficient techniques (LoRA and QLoRA), demonstrating strong performance across standard and adversarial NLI benchmarks. Our 32B AWQ-quantized model surpasses state-of-the-art results on 7 out of 11 adversarial setsor on all of them considering our replicationwithin a 22GB memory footprint, showing that robust reasoning can be retained under aggressive quantization. This work provides a scalable and practical framework for building robust NLI systems without sacrificing inference quality.

Paper Structure

This paper contains 36 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Average accuracy on our dataset selection against model size on GPU. Models fine-tuned by GRPO have a star marker, whereas base models have a dot marker. We select our best models with a LoRA rank of 64.
  • Figure 2: Effect of GRPO training on ANLI accuracy by model size, test set and decoding temperature. For non-greedy decoding, five evaluations are performed, and the standard deviation is plotted as an error line. We use models from the family. GRPO is performed with LoRA and a rank of $64$.
  • Figure 3: Training metrics by LoRA rank for Qwen2.5-7B-Instruct.
  • Figure 4: Training metrics by number of parameters for Qwen2.5-*B-Instruct-AWQ models.
  • Figure 5: Confusion matrices between the base and GRPO models' predictions, and between predictions and gold labels, for ANLI R3 test examples where the base and GRPO models differ in output.