Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
Md Rifat Arefin, Gopeshh Subbaraj, Nicolas Gontier, Yann LeCun, Irina Rish, Ravid Shwartz-Ziv, Christopher Pal
TL;DR
This work identifies representation collapse in intermediate Transformer layers as a key bottleneck for multi-step arithmetic reasoning and introduces Sequential Variance-Covariance Regularization (Seq-VCR) to preserve representation diversity. By combining Seq-VCR with dummy pause tokens that mimic chain-of-thought without supervision, the method achieves strong reasoning performance, notably reaching 99.5% exact match on 5×5 multiplication and surpassing GPT-4 with five-shot CoT in some settings. Seq-VCR promotes high-variance, low-covariance intermediate representations, evidenced by higher layer entropy and improved learning dynamics, while also reducing inference time relative to explicit CoT. The approach yields significant gains on Arithmetic Expressions and LIS tasks, highlighting its potential as a robust,CoT-free mechanism to enhance transformer reasoning. These results underscore the importance of preventing intermediate-collapse and suggest broader applicability to diverse sequential reasoning problems.
Abstract
Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging $5 \times 5$ integer multiplication task, our approach achieves $99.5\%$ exact match accuracy, outperforming models of the same size (which yield $0\%$ accuracy) and GPT-4 with five-shot CoT prompting ($44\%$). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.
