Explore Papers
Read papers in beautiful, interactive HTML
Trending Today
EmbryoDiff: A Conditional Diffusion Framework with Multi-Focal Feature Fusion for Fine-Grained Embryo Developmental Stage Recognition
Authors: Yong Sun, Zhengjie Zhang, Junyu Shi, Zhiyuan Zhang, Lijiang Liu, Qiang Nie
Identification of fine-grained embryo developmental stages during In Vitro Fertilization (IVF) is crucial for assessing embryo viability. Although recent deep learning methods have achieved promising accuracy, existing discriminative models fail to utilize the distributional prior of embryonic development to improve accuracy. Moreover, their reliance on single-focal information leads to incomplete embryonic representations, making them susceptible to feature ambiguity under cell occlusions. To address these limitations, we propose EmbryoDiff, a two-stage diffusion-based framework that formulates the task as a conditional sequence denoising process. Specifically, we first train and freeze a frame-level encoder to extract robust multi-focal features. In the second stage, we introduce a Multi-Focal Feature Fusion Strategy that aggregates information across focal planes to construct a 3D-aware morphological representation, effectively alleviating ambiguities arising from cell occlusions. Building on this fused representation, we derive complementary semantic and boundary cues and design a Hybrid Semantic-Boundary Condition Block to inject them into the diffusion-based denoising process, enabling accurate embryonic stage classification. Extensive experiments on two benchmark datasets show that our method achieves state-of-the-art results. Notably, with only a single denoising step, our model obtains the best average test performance, reaching 82.8% and 81.3% accuracy on the two datasets, respectively.
cs.CV
View Paper
Training Neural Networks at Any Scale
Authors: Thomas Pethick, Kimon Antonakopoulos, Antonio Silveti-Falls, Leena Chennuru Vankadara, Volkan Cevher
This article reviews modern optimization methods for training neural networks with an emphasis on efficiency and scale. We present state-of-the-art optimization algorithms under a unified algorithmic template that highlights the importance of adapting to the structures in the problem. We then cover how to make these algorithms agnostic to the scale of the problem. Our exposition is intended as an introduction for both practitioners and researchers who wish to be involved in these exciting new developments.
cs.LG
View Paper
Relaxation to an Ideal Chern Band through Coupling to a Markovian Bath
Authors: Bruno Mera, Tomoki Ozawa
We propose a microscopic, weak-coupling mechanism by which generic Chern bands relax toward ideal bands. We consider coupling interacting electrons to a Caldeira-Leggett like Ohmic bosonic bath. Using the Born-Markov approximation, Slater determinant states of a Chern band under Hartree-Fock approximation evolve toward Slater determinant states corresponding to an ideal Chern band. We validate our proposal by performing numerical simulation of a massive Dirac model, showing that the Berry curvature and quantum metric indeed co-evolve to saturate the trace condition. Our proposal provides a concrete dissipative route to realize ideal Chern bands, a fundamental building block for the stabilization of fractional Chern insulators.
cond-mat.mes-hallcond-mat.quant-gasmath-ph
View Paper
C2Views: Knowledge-based Colormap Design for Multiple-View Consistency
Authors: Yihan Hou, Yilin Ye, Liangwei Wang, Huamin Qu, Wei Zeng
Multiple-view (MV) visualization provides a comprehensive and integrated perspective on complex data, establishing itself as an effective method for visual communication and exploratory data analysis. While existing studies have predominantly focused on designing explicit visual linkages and coordinated interactions to facilitate the exploration of MV visualizations, these approaches often demand extra graphical and interactive effort, overlooking the potential of color as an effective channel for encoding data and relationships. Addressing this oversight, we introduce C2Views, a new framework for colormap design that implicitly shows the relation across views. We begin by structuring the components and their relationships within MVs into a knowledge-based graph specification, wherein colormaps, data, and views are denoted as entities, and the interactions among them are illustrated as relations. Building on this representation, we formulate the design criteria as an optimization problem and employ a genetic algorithm enhanced by Pareto optimality, generating colormaps that balance single-view effectiveness and multiple-view consistency. Our approach is further complemented with an interactive interface for user-intended refinement. We demonstrate the feasibility of C2Views through various colormap design examples for MVs, underscoring its adaptability to diverse data relationships and view layouts. Comparative user studies indicate that our method outperforms the existing approach in facilitating color distinction and enhancing multiple-view consistency, thereby simplifying data exploration processes.
cs.HC
View Paper
6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data
Authors: Saptarshi Neil Sinha, Julius Kühn, Mika Silvan Goschke, Michael Weinmann
Automated and selective harvesting of fruits has become an important area of research, particularly due to challenges such as high costs and a shortage of seasonal labor in advanced economies. This paper focuses on 6D pose estimation of strawberries using purely synthetic data generated through a procedural pipeline for photorealistic rendering. We employ the YOLOX-6D-Pose algorithm, a single-shot approach that leverages the YOLOX backbone, known for its balance between speed and accuracy, and its support for edge inference. To address the lacking availability of training data, we introduce a robust and flexible pipeline for generating synthetic strawberry data from various 3D models via a procedural Blender pipeline, where we focus on enhancing the realism of the synthesized data in comparison to previous work to make it a valuable resource for training pose estimation algorithms. Quantitative evaluations indicate that our models achieve comparable accuracy on both the NVIDIA RTX 3090 and Jetson Orin Nano across several ADD-S metrics, with the RTX 3090 demonstrating superior processing speed. However, the Jetson Orin Nano is particularly suited for resource-constrained environments, making it an excellent choice for deployment in agricultural robotics. Qualitative assessments further confirm the model's performance, demonstrating its capability to accurately infer the poses of ripe and partially ripe strawberries, while facing challenges in detecting unripe specimens. This suggests opportunities for future improvements, especially in enhancing detection capabilities for unripe strawberries (if desired) by exploring variations in color. Furthermore, the methodology presented could be adapted easily for other fruits such as apples, peaches, and plums, thereby expanding its applicability and impact in the field of agricultural automation.
cs.CVcs.RO
View Paper
Sheaf Cohomology of Linear Predictive Coding Networks
Authors: Jeffrey Seely
Predictive coding (PC) replaces global backpropagation with local optimization over weights and activations. We show that linear PC networks admit a natural formulation as cellular sheaves: the sheaf coboundary maps activations to edge-wise prediction errors, and PC inference is diffusion under the sheaf Laplacian. Sheaf cohomology then characterizes irreducible error patterns that inference cannot remove. We analyze recurrent topologies where feedback loops create internal contradictions, introducing prediction errors unrelated to supervision. Using a Hodge decomposition, we determine when these contradictions cause learning to stall. The sheaf formalism provides both diagnostic tools for identifying problematic network configurations and design principles for effective weight initialization for recurrent PC networks.
cs.LG
View Paper
FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
Authors: Yonatan Dukler, Guihong Li, Deval Shah, Vikram Appia, Emad Barsoum
Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.
cs.LG
View Paper
Bridging Hidden States in Vision-Language Models
Authors: Benjamin Fein-Ashley, Jacob Fein-Ashley
Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.
cs.CV
View Paper
Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model
Authors: Xinyue Zhang, Haolong Li, Jiawei Ma, Chen Ye
Vectorized glyphs are widely used in poster design, network animation, art display, and various other fields due to their scalability and flexibility. In typography, they are often seen as special sequences composed of ordered strokes. This concept extends to the token sequence prediction abilities of large language models (LLMs), enabling vectorized character generation through stroke modeling. In this paper, we propose a novel Large Vectorized Glyph Model (LVGM) designed to generate vectorized Chinese glyphs by predicting the next stroke. Initially, we encode strokes into discrete latent variables called stroke embeddings. Subsequently, we train our LVGM via fine-tuning DeepSeek LLM by predicting the next stroke embedding. With limited strokes given, it can generate complete characters, semantically elegant words, and even unseen verses in vectorized form. Moreover, we release a new large-scale Chinese SVG dataset containing 907,267 samples based on strokes for dynamically vectorized glyph generation. Experimental results show that our model has scaling behaviors on data scales. Our generated vectorized glyphs have been validated by experts and relevant individuals.
cs.CV
View Paper
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Authors: Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
cs.CVcs.CL
View Paper
PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
Authors: Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, Yunzhong He
Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
cs.CLcs.CY
View Paper
When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
Authors: Aladin Djuhera, Farhan Ahmed, Swanand Ravindra Kadhe, Syed Zawad, Heiko Ludwig, Holger Boche
Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.
cs.CLcs.AI
View Paper
The Riemann Hypothesis Emerges in Dynamical Quantum Phase Transitions
Authors: ShiJie Wei, Yue Zhai, Quanfeng Lu, Wentao Yang, Pan Gao, Chao Wei, Junda Song, Franco Nori, Tao Xin, GuiLu Long
The Riemann Hypothesis (RH), one of the most profound unsolved problems in mathematics, concerns the nontrivial zeros of the Riemann zeta function. Establishing connections between the RH and physical phenomena could offer new perspectives on its physical origin and verification. Here, we establish a direct correspondence between the nontrivial zeros of the zeta function and dynamical quantum phase transitions (DQPTs) in two realizable quantum systems, characterized by the averaged accumulated phase factor and the Loschmidt amplitude, respectively. This precise correspondence reveals that the RH can be viewed as the emergence of DQPTs at a specific temperature. We experimentally demonstrate this correspondence on a five-qubit spin-based system and further propose an universal quantum simulation framework for efficiently realizing both systems with polynomial resources, offering a quantum advantage for numerical verification of the RH. These findings uncover an intrinsic link between nonequilibrium critical dynamics and the RH, positioning quantum computing as a powerful platform for exploring one of mathematics' most enduring conjectures and beyond.
quant-phhep-thmath-ph
View Paper
Algebraic Consistency and Explicit Construction of One-Loop BCJ Numerators of Yang-Mills and Related Theories
Authors: Yi-Jian Du, Chih-Hao Fu, Yihong Wang, Chongsi Xie
We study the algebraic structure of one-loop BCJ numerators in Yang-Mills and related theories. Starting from the propagator matrix that connects colour-ordered integrands to numerators, we identify the consistency conditions that ensure the existence of Jacobi-satisfying numerator solutions and determine the unique construction. The relation between one-loop numerators and forward-limit tree numerators is clarified, together with the additional physical conditions required for a consistent double-copy interpretation.
We propose a two-step expansion strategy for obtaining explicit one-loop numerators. The Yang-Mills integrand is first decomposed into scalar-loop Yang-Mills-scalar building blocks, which are then expanded into bi-adjoint scalar integrands. We derive explicit results for up to three external gluons, showing how the kinematic consistency conditions uniquely determine the coefficients in each case. Similar results for Einstein-Yang-Mills and gravity amplitudes are also presented.
hep-thhep-ph
View Paper
From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
Authors: Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi
Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.
cs.CVcs.CL
View Paper
Analytic structure of the high-energy gravitational amplitude: multi-H diagrams and classical 5PM logarithms
Authors: Francesco Alessio, Vittorio Del Duca, Riccardo Gonzo, Emanuele Rosi, Ira Z. Rothstein, Michael Saavedra
We investigate the high-energy, small-angle limit of two-body gravitational scattering. Using power counting arguments and dispersion relations in an effective field theory for the Regge regime, we derive the general loop expansion that determines how the leading Regge logarithms and their complex structure arise as a power series in $t/s$. Focusing on the tower of multi-H diagrams that govern the leading logarithmic behavior, we compute the leading double logarithm at four loops (5PM) using both effective field theory methods and the multi-Regge expansion, finding complete agreement. Finally, using the aforementioned dispersion relations, we extract the single logarithmic contribution to the imaginary part of the eikonal phase at 5PM in the Regge limit.
hep-thgr-qchep-ph
View Paper
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Authors: Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.
cs.CV
View Paper
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference
Authors: Wei Fan, JinYi Yoon, Bo Ji
Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).
cs.CLcs.AIcs.MA
View Paper
Two-loop all-plus helicity amplitudes for self-dual Higgs boson with gluons via unitarity cut constraints
Authors: Simon Badger, Christian Biello, Colomba Brancaccio, Federico Ripani
We present the two-loop amplitudes for a self-dual Higgs boson with up to four positive helicity gluons in the heavy top-quark limit. Because the tree amplitudes in the all-plus sector vanish, we can construct simple representations of the polylogarithmic parts of the two-loop amplitudes using four-dimensional unitarity cuts into rational one-loop and tree amplitudes. The remaining rational function ambiguity is extracted from a tensor integral reduction over finite fields. The final expressions are presented using polylogarithms up to weight two and compact rational functions of spinor-helicity products.
hep-phhep-th
View Paper
Effects of Early-Universe Inhomogeneity on Bubble Formation: Primordial Black Holes as an Extreme Case
Authors: Yijie Chang, Shihang Tang, Haowen Deng, Yefeng Wang, Ran Ding, Fa Peng Huang
Our early Universe is not perfectly homogeneous and it may contain some inhomogeneous sources, which might distort the local spacetime and modify the bubble nucleation rate. Taking the primordial black hole as an extreme example, we investigate the bubble nucleation rate of a first-order phase transition in the vicinity of primordial black holes or other primordial gravitational sources. Our analysis reveals that the presence of primordial black holes can reduce the effective action and might modify the nucleation rate due to their gravitational effects, potentially altering the dynamics of the phase transition in the early universe and producing new gravitational wave signals since the gravitational effects of the primordial black hole or other possible inhomogeneous sources could lead to nucleation of non-spherical symmetric bubbles.
hep-ph
View Paper
Highly Cited Papers
Deep Residual Learning for Image Recognition
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Deeper neural networks are more difficult to train. We present a residual
learning framework to ease the training of networks that are substantially
deeper than those used previously. We explicitly reformulate the layers as
learning residual functions with reference to the layer inputs, instead of
learning unreferenced functions. We provide comprehensive empirical evidence
showing that these residual networks are easier to optimize, and can gain
accuracy from considerably increased depth. On the ImageNet dataset we evaluate
residual nets with a depth of up to 152 layers---8x deeper than VGG nets but
still having lower complexity. An ensemble of these residual nets achieves
3.57% error on the ImageNet test set. This result won the 1st place on the
ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100
and 1000 layers.
The depth of representations is of central importance for many visual
recognition tasks. Solely due to our extremely deep representations, we obtain
a 28% relative improvement on the COCO object detection dataset. Deep residual
nets are foundations of our submissions to ILSVRC & COCO 2015 competitions,
where we also won the 1st places on the tasks of ImageNet detection, ImageNet
localization, COCO detection, and COCO segmentation.
cs.CV
View Paper
Attention Is All You Need
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks in an encoder-decoder configuration. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer, based
solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014
English-to-German translation task, improving over the existing best results,
including ensembles by over 2 BLEU. On the WMT 2014 English-to-French
translation task, our model establishes a new single-model state-of-the-art
BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction
of the training costs of the best models from the literature. We show that the
Transformer generalizes well to other tasks by applying it successfully to
English constituency parsing both with large and limited training data.
cs.CLcs.LG
View Paper
U-Net: Convolutional Networks for Biomedical Image Segmentation
Authors: Olaf Ronneberger, Philipp Fischer, Thomas Brox
There is large consent that successful training of deep networks requires
many thousand annotated training samples. In this paper, we present a network
and training strategy that relies on the strong use of data augmentation to use
the available annotated samples more efficiently. The architecture consists of
a contracting path to capture context and a symmetric expanding path that
enables precise localization. We show that such a network can be trained
end-to-end from very few images and outperforms the prior best method (a
sliding-window convolutional network) on the ISBI challenge for segmentation of
neuronal structures in electron microscopic stacks. Using the same network
trained on transmitted light microscopy images (phase contrast and DIC) we won
the ISBI cell tracking challenge 2015 in these categories by a large margin.
Moreover, the network is fast. Segmentation of a 512x512 image takes less than
a second on a recent GPU. The full implementation (based on Caffe) and the
trained networks are available at
http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
cs.CV
View Paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Authors: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
State-of-the-art object detection networks depend on region proposal
algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN
have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region
Proposal Network (RPN) that shares full-image convolutional features with the
detection network, thus enabling nearly cost-free region proposals. An RPN is a
fully convolutional network that simultaneously predicts object bounds and
objectness scores at each position. The RPN is trained end-to-end to generate
high-quality region proposals, which are used by Fast R-CNN for detection. We
further merge RPN and Fast R-CNN into a single network by sharing their
convolutional features---using the recently popular terminology of neural
networks with 'attention' mechanisms, the RPN component tells the unified
network where to look. For the very deep VGG-16 model, our detection system has
a frame rate of 5fps (including all steps) on a GPU, while achieving
state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS
COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015
competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning
entries in several tracks. Code has been made publicly available.
cs.CV
View Paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
While the Transformer architecture has become the de-facto standard for
natural language processing tasks, its applications to computer vision remain
limited. In vision, attention is either applied in conjunction with
convolutional networks, or used to replace certain components of convolutional
networks while keeping their overall structure in place. We show that this
reliance on CNNs is not necessary and a pure transformer applied directly to
sequences of image patches can perform very well on image classification tasks.
When pre-trained on large amounts of data and transferred to multiple mid-sized
or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision
Transformer (ViT) attains excellent results compared to state-of-the-art
convolutional networks while requiring substantially fewer computational
resources to train.
cs.CVcs.AIcs.LG
View Paper
Microsoft COCO: Common Objects in Context
Authors: Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár
We present a new dataset with the goal of advancing the state-of-the-art in
object recognition by placing the question of object recognition in the context
of the broader question of scene understanding. This is achieved by gathering
images of complex everyday scenes containing common objects in their natural
context. Objects are labeled using per-instance segmentations to aid in precise
object localization. Our dataset contains photos of 91 objects types that would
be easily recognizable by a 4 year old. With a total of 2.5 million labeled
instances in 328k images, the creation of our dataset drew upon extensive crowd
worker involvement via novel user interfaces for category detection, instance
spotting and instance segmentation. We present a detailed statistical analysis
of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide
baseline performance analysis for bounding box and segmentation detection
results using a Deformable Parts Model.
cs.CV
View Paper
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Authors: Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala
Deep learning frameworks have often focused on either usability or speed, but
not both. PyTorch is a machine learning library that shows that these two goals
are in fact compatible: it provides an imperative and Pythonic programming
style that supports code as a model, makes debugging easy and is consistent
with other popular scientific computing libraries, while remaining efficient
and supporting hardware accelerators such as GPUs.
In this paper, we detail the principles that drove the implementation of
PyTorch and how they are reflected in its architecture. We emphasize that every
aspect of PyTorch is a regular Python program under the full control of its
user. We also explain how the careful and pragmatic implementation of the key
components of its runtime enables them to work together to achieve compelling
performance.
We demonstrate the efficiency of individual subsystems, as well as the
overall speed of PyTorch on several common benchmarks.
cs.LGcs.MSstat.ML
View Paper
Mask R-CNN
Authors: Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick
We present a conceptually simple, flexible, and general framework for object
instance segmentation. Our approach efficiently detects objects in an image
while simultaneously generating a high-quality segmentation mask for each
instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a
branch for predicting an object mask in parallel with the existing branch for
bounding box recognition. Mask R-CNN is simple to train and adds only a small
overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to
generalize to other tasks, e.g., allowing us to estimate human poses in the
same framework. We show top results in all three tracks of the COCO suite of
challenges, including instance segmentation, bounding-box object detection, and
person keypoint detection. Without bells and whistles, Mask R-CNN outperforms
all existing, single-model entries on every task, including the COCO 2016
challenge winners. We hope our simple and effective approach will serve as a
solid baseline and help ease future research in instance-level recognition.
Code has been made available at: https://github.com/facebookresearch/Detectron
cs.CV
View Paper
GPT-4 Technical Report
Authors: OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance on
various professional and academic benchmarks, including passing a simulated bar
exam with a score around the top 10% of test takers. GPT-4 is a
Transformer-based model pre-trained to predict the next token in a document.
The post-training alignment process results in improved performance on measures
of factuality and adherence to desired behavior. A core component of this
project was developing infrastructure and optimization methods that behave
predictably across a wide range of scales. This allowed us to accurately
predict some aspects of GPT-4's performance based on models trained with no
more than 1/1,000th the compute of GPT-4.
cs.CLcs.AI
View Paper
Caffe: Convolutional Architecture for Fast Feature Embedding
Authors: Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, Trevor Darrell
Caffe provides multimedia scientists and practitioners with a clean and
modifiable framework for state-of-the-art deep learning algorithms and a
collection of reference models. The framework is a BSD-licensed C++ library
with Python and MATLAB bindings for training and deploying general-purpose
convolutional neural networks and other deep models efficiently on commodity
architectures. Caffe fits industry and internet-scale media needs by CUDA GPU
computation, processing over 40 million images a day on a single K40 or Titan
GPU ($\approx$ 2.5 ms per image). By separating model representation from
actual implementation, Caffe allows experimentation and seamless switching
among platforms for ease of development and deployment from prototyping
machines to cloud environments. Caffe is maintained and developed by the
Berkeley Vision and Learning Center (BVLC) with the help of an active community
of contributors on GitHub. It powers ongoing research projects, large-scale
industrial applications, and startup prototypes in vision, speech, and
multimedia.
cs.CVcs.LGcs.NE
View Paper
Continuous control with deep reinforcement learning
Authors: Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra
We adapt the ideas underlying the success of Deep Q-Learning to the
continuous action domain. We present an actor-critic, model-free algorithm
based on the deterministic policy gradient that can operate over continuous
action spaces. Using the same learning algorithm, network architecture and
hyper-parameters, our algorithm robustly solves more than 20 simulated physics
tasks, including classic problems such as cartpole swing-up, dexterous
manipulation, legged locomotion and car driving. Our algorithm is able to find
policies whose performance is competitive with those found by a planning
algorithm with full access to the dynamics of the domain and its derivatives.
We further demonstrate that for many of the tasks the algorithm can learn
policies end-to-end: directly from raw pixel inputs.
cs.LGstat.ML
View Paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
Authors: Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom
In this work, we develop and release Llama 2, a collection of pretrained and
fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70
billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for
dialogue use cases. Our models outperform open-source chat models on most
benchmarks we tested, and based on our human evaluations for helpfulness and
safety, may be a suitable substitute for closed-source models. We provide a
detailed description of our approach to fine-tuning and safety improvements of
Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
cs.CLcs.AI
View Paper
Convolutional Neural Networks for Sentence Classification
Authors: Yoon Kim
We report on a series of experiments with convolutional neural networks (CNN)
trained on top of pre-trained word vectors for sentence-level classification
tasks. We show that a simple CNN with little hyperparameter tuning and static
vectors achieves excellent results on multiple benchmarks. Learning
task-specific vectors through fine-tuning offers further gains in performance.
We additionally propose a simple modification to the architecture to allow for
the use of both task-specific and static vectors. The CNN models discussed
herein improve upon the state of the art on 4 out of 7 tasks, which include
sentiment analysis and question classification.
cs.CLcs.NE
View Paper
FaceNet: A Unified Embedding for Face Recognition and Clustering
Authors: Florian Schroff, Dmitry Kalenichenko, James Philbin
Despite significant recent advances in the field of face recognition,
implementing face verification and recognition efficiently at scale presents
serious challenges to current approaches. In this paper we present a system,
called FaceNet, that directly learns a mapping from face images to a compact
Euclidean space where distances directly correspond to a measure of face
similarity. Once this space has been produced, tasks such as face recognition,
verification and clustering can be easily implemented using standard techniques
with FaceNet embeddings as feature vectors.
Our method uses a deep convolutional network trained to directly optimize the
embedding itself, rather than an intermediate bottleneck layer as in previous
deep learning approaches. To train, we use triplets of roughly aligned matching
/ non-matching face patches generated using a novel online triplet mining
method. The benefit of our approach is much greater representational
efficiency: we achieve state-of-the-art face recognition performance using only
128-bytes per face.
On the widely used Labeled Faces in the Wild (LFW) dataset, our system
achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves
95.12%. Our system cuts the error rate in comparison to the best published
result by 30% on both datasets.
We also introduce the concept of harmonic embeddings, and a harmonic triplet
loss, which describe different versions of face embeddings (produced by
different networks) that are compatible to each other and allow for direct
comparison between each other.
cs.CV
View Paper
Finding and evaluating community structure in networks
Authors: M. E. J. Newman, M. Girvan
We propose and study a set of algorithms for discovering community structure
in networks -- natural divisions of network nodes into densely connected
subgroups. Our algorithms all share two definitive features: first, they
involve iterative removal of edges from the network to split it into
communities, the edges removed being identified using one of a number of
possible "betweenness" measures, and second, these measures are, crucially,
recalculated after each removal. We also propose a measure for the strength of
the community structure found by our algorithms, which gives us an objective
metric for choosing the number of communities into which a network should be
divided. We demonstrate that our algorithms are highly effective at discovering
community structure in both computer-generated and real-world network data, and
show how they can be used to shed light on the sometimes dauntingly complex
structure of networked systems.
cond-mat.stat-mechcond-mat.dis-nn
View Paper
YOLOv4: Optimal Speed and Accuracy of Object Detection
Authors: Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao
There are a huge number of features which are said to improve Convolutional
Neural Network (CNN) accuracy. Practical testing of combinations of such
features on large datasets, and theoretical justification of the result, is
required. Some features operate on certain models exclusively and for certain
problems exclusively, or only for small-scale datasets; while some features,
such as batch-normalization and residual-connections, are applicable to the
majority of models, tasks, and datasets. We assume that such universal features
include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections
(CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT)
and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation,
Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and
combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50)
for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source
code is at https://github.com/AlexeyAB/darknet
cs.CVeess.IV
View Paper
Representation Learning: A Review and New Perspectives
Authors: Yoshua Bengio, Aaron Courville, Pascal Vincent
The success of machine learning algorithms generally depends on data
representation, and we hypothesize that this is because different
representations can entangle and hide more or less the different explanatory
factors of variation behind the data. Although specific domain knowledge can be
used to help design representations, learning with generic priors can also be
used, and the quest for AI is motivating the design of more powerful
representation-learning algorithms implementing such priors. This paper reviews
recent work in the area of unsupervised feature learning and deep learning,
covering advances in probabilistic models, auto-encoders, manifold learning,
and deep networks. This motivates longer-term unanswered questions about the
appropriate objectives for learning good representations, for computing
representations (i.e., inference), and the geometrical connections between
representation learning, density estimation and manifold learning.
cs.LG
View Paper
Playing Atari with Deep Reinforcement Learning
Authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller
We present the first deep learning model to successfully learn control
policies directly from high-dimensional sensory input using reinforcement
learning. The model is a convolutional neural network, trained with a variant
of Q-learning, whose input is raw pixels and whose output is a value function
estimating future rewards. We apply our method to seven Atari 2600 games from
the Arcade Learning Environment, with no adjustment of the architecture or
learning algorithm. We find that it outperforms all previous approaches on six
of the games and surpasses a human expert on three of them.
cs.LG
View Paper
Pyramid Scene Parsing Network
Authors: Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia
Scene parsing is challenging for unrestricted open vocabulary and diverse
scenes. In this paper, we exploit the capability of global context information
by different-region-based context aggregation through our pyramid pooling
module together with the proposed pyramid scene parsing network (PSPNet). Our
global prior representation is effective to produce good quality results on the
scene parsing task, while PSPNet provides a superior framework for pixel-level
prediction tasks. The proposed approach achieves state-of-the-art performance
on various datasets. It came first in ImageNet scene parsing challenge 2016,
PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields new
record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on
Cityscapes.
cs.CV
View Paper
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
We explore how generating a chain of thought -- a series of intermediate
reasoning steps -- significantly improves the ability of large language models
to perform complex reasoning. In particular, we show how such reasoning
abilities emerge naturally in sufficiently large language models via a simple
method called chain of thought prompting, where a few chain of thought
demonstrations are provided as exemplars in prompting. Experiments on three
large language models show that chain of thought prompting improves performance
on a range of arithmetic, commonsense, and symbolic reasoning tasks. The
empirical gains can be striking. For instance, prompting a 540B-parameter
language model with just eight chain of thought exemplars achieves state of the
art accuracy on the GSM8K benchmark of math word problems, surpassing even
finetuned GPT-3 with a verifier.
cs.CLcs.AI
View Paper
