ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Su, Mingluo; Wang, Huan

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

CPAL 2026 oral

Mingluo Su¹, Huan Wang^1*

¹Westlake University

^*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn

Paper Code

Abstract

Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of one-shot LLM pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weight columns with larger potential pruning errors to be processed earlier. ROSE first performs pre-pruning to identify weights that are highly likely to be pruned, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods.

Motivation

(a) Change in reconstruction error of the "self\_attn.o\_proj" layer in the first Transformer Block of LLaMA2-7B during SparseGPT pruning as the number of pruned blocks increases. The sharpest increase appears at a later stage. (b) Weight visualization shows a columnar pattern along the input channel, with one block containing the most concentrated high-magnitude weights. (c) Reconstruction error after reordering: pruning the high-error block earlier yields lower total error.

Overview of our ROSE method

(a) Overview of the difference between SparseGPT and ROSE. In SparseGPT, as pruning progresses, fewer weights remain for compensation. If high-error weights are pruned late, compensation is limited. ROSE reorders columns to make high-error weights be pruned early, preserving more parameters for compensation. (b) Illustration of our ROSE for the columnar layer. Given the dense weight W and target sparsity rate p%, we calculate the importance score S and split it into blocks based on blocksize.The smallest p% of values from each block are selected as the loss matrix L. Column loss and block loss are calculated based on the loss matrix. Columns within one block are reordered in descending order of column loss, and blocks are reordered in descending order of block loss. .

Main Results

Relative reconstruction error of the "self\_attn.o\_proj" layer in the second Transformer Block of LLaMA2-7B by ROSE and its variants at varying sparsity rates.

WikiText perplexity performance on LLaMA3-8B and Mistral-7B at varying sparsity rates. The best results are highlighted in bold and the second-best results are indicated with underline.

WikiText perplexity and zero-shot accuracy on LLaMA2 models at 70% sparsity for different unstructured pruning methods. The best results are highlighted in bold and the second-best results are indicated with underline.

BibTeX

@inproceedings{su2025rose,
  title={ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning},
  author={Su, Mingluo and Wang, Huan},
  booktitle={CPAL},
  year={2026}
}