[论文翻译]GIFT-SW: 大语言模型中显著权重的高斯噪声注入微调


原文地址:https://arxiv.org/pdf/2408.15300v1


GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs

GIFT-SW: 大语言模型中显著权重的高斯噪声注入微调

Abstract

摘要

Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized the usage of Large Language Models (LLMs). Recent studies have shown that a small subset of weights significantly impacts performance. Based on this observation, we introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFTSW). Our method updates only salient columns, while injecting Gaussian noise into non-salient ones. To identify these columns, we developed a generalized sensitivity metric that extends and unifies metrics from previous studies. Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget. Moreover, GIFT-SW offers practical advantages to recover performance of models subjected to mixed-precision quantization with keeping salient weights in full precision. Code is available in our repository.

参数高效微调 (Parameter Efficient Fine-Tuning, PEFT) 方法日益普及,推动了大语言模型 (Large Language Model, LLM) 的广泛应用。近期研究表明,模型权重中一个小子集对性能具有决定性影响。基于这一发现,我们提出了一种新型PEFT方法——显著权重高斯噪声注入微调 (Gaussian noise Injected Fine Tuning of Salient Weights, GIFTSW)。该方法仅更新显著权重列,同时向非显著列注入高斯噪声。为识别这些权重列,我们提出了一种通用敏感度度量标准,该标准扩展并统一了先前研究中的多种度量方法。在LLaMA模型上的实验表明,在相同计算预算下,GIFT-SW性能优于全量微调和现代PEFT方法。此外,GIFT-SW具有显著实用优势:通过保持显著权重全精度,可有效恢复混合精度量化模型的性能。代码已开源。

1 Introduction

1 引言

Modern LLMs demonstrate remarkable general- ization capabilities on unseen tasks. However, finetuning remains crucial to enhance these models performance or to restore the performance after compression techniques like quantization (Dettmers et al., 2024; Mos kv or ets kii et al., 2024), prun- ing (Frantar and Alistarh, 2023; Kim et al., 2023), or tensor decomposition have been applied. Given the large scale of modern LLMs, fine-tuning all parameters can be computationally and memoryintensive. To overcome this challenge, Parameter Efficient Fine-Tuning schemes have been developed, aimed to improve model performance while using limited computational and memory resources.

现代大语言模型在未见任务上展现出卓越的泛化能力。然而,微调对于提升模型性能或在应用量化 (Dettmers et al., 2024; Moskvoretskii et al., 2024)、剪枝 (Frantar and Alistarh, 2023; Kim et al., 2023) 或张量分解等压缩技术后恢复性能仍然至关重要。鉴于现代大语言模型的庞大规模,微调所有参数会带来极高的计算和内存开销。为应对这一挑战,参数高效微调方案应运而生,旨在使用有限的计算和内存资源提升模型性能。

To date, PEFT methods have not matched the accuracy of full fine-tuning (Nikdan et al., 2024), highlighting the need for new approaches that can close this gap while still minimizing resource use. Additionally, most PEFT methods involve adding extra parameters, which increases computational demands.

迄今为止,PEFT方法尚未达到全参数微调的精度 (Nikdan et al., 2024) ,这凸显了需要开发既能缩小这一差距又能最大限度减少资源消耗的新方法。此外,大多数PEFT方法需要添加额外参数,从而增加了计算需求。

To address those issues and enhance the performance of efficiently trained LLMs, we introduce a novel PEFT method, GIFT-SW. This approach focuses on updating a small subset of salient weights while injecting noise into the non-salient weights. The development of this method is grounded in observations from previous studies and the related questions they raise, which we aim to answer:

为了解决这些问题并提升高效训练大语言模型的性能,我们引入了一种新颖的参数高效微调(PEFT)方法GIFT-SW。该方法专注于更新一小部分显著权重,同时向非显著权重注入噪声。该方法的开发基于先前研究的观察结果及其引发的相关问题,我们旨在回答这些问题:

Previous research has shown that there is a small subset of salient weights which can significantly affect the effectiveness of post-training quantization (PTQ) (Dettmers et al., 2022, 2023; Kim et al., 2023) and pruning techniques (Yin et al., 2023; Frantar and Alistarh, 2023; Sun et al., 2023). More- over, Gurnee et al. identified a group of "universal neurons" that are critical to a model’s functionality, emphasizing the importance of selecting and updating these salient weights. Question 1: Does updating a small subset of salient weights is sufficient to adjust the model?

先前研究表明,存在一小部分显著权重(salient weights)能极大影响训练后量化(PTQ) (Dettmers等, 2022, 2023; Kim等, 2023)和剪枝技术(Yin等, 2023; Frantar与Alistarh, 2023; Sun等, 2023)的效果。此外,Gurnee等发现了一组对模型功能至关重要的"通用神经元",凸显了选择并更新这些显著权重的重要性。问题1:仅更新一小部分显著权重是否足以调整模型?

Recent studies have demonstrated that Perturbed Gradient Descent (PGD), with noise injections applied both before and after the gradient step, can stabilize convergence and help prevent over fitting (Poole et al., 2014; Zhu et al., 2018; Jin et al., 2021). Question 2: Does Injecting Noise helps convergence?

近期研究表明,在梯度步骤前后施加噪声扰动的扰动梯度下降(PGD)能够稳定收敛并防止过拟合 (Poole et al., 2014; Zhu et al., 2018; Jin et al., 2021)。问题2:注入噪声是否有助于收敛?

PGD is commonly employed to enhance model robustness by approximating the quantization process (Shvetsov et al., 2022; Shin et al., 2023; Défos- sez et al., 2021). This increased robustness can aid in maintaining the quality of the quantized model. Question 3: Does injecting noise helps robustness?

PGD通常用于通过近似量化过程来增强模型鲁棒性 (Shvetsov et al., 2022; Shin et al., 2023; Défossez et al., 2021)。这种增强的鲁棒性有助于保持量化模型的质量。问题3:注入噪声是否有助于提升鲁棒性?

Selecting salient weights is a significant challenge, particularly in quantization and pruning, and it is central to our method. In our paper, we derive a general formulation for all previously established saliency metrics and present experiments to compare their effectiveness.

选择显著权重是一个重大挑战,尤其是在量化(quantization)和剪枝(pruning)过程中,这也是我们方法的核心。在论文中,我们推导了所有现有显著性指标(saliency metrics)的通用公式,并通过实验比较了它们的有效性。


Figure 1: Mean performance of different fine-tuning approaches for LLaMA models with scaling data budget. GIFT-SW shows superior performance with nearly all data budgets, also being as stable as full fine-tuning.

图 1: 不同微调方法在LLaMA模型上随数据预算变化的平均性能表现。GIFT-SW在几乎所有数据预算下都展现出更优性能,同时保持与全参数微调相当的稳定性。

The main contributions of our work can be summarized as follows:

我们工作的主要贡献可总结如下:

• We introduce a novel PEFT method for pretrained and quantized LLMs, called GIFT-SW. It is designed to fine-tune weights in salient columns while injecting Gaussian noise into non-salient weights, which are kept frozen during training.

• 我们提出了一种针对预训练和量化大语言模型 (LLM) 的新型参数高效微调 (PEFT) 方法 GIFT-SW。该方法通过在训练期间冻结非显著权重并向其注入高斯噪声,同时微调显著列中的权重来实现高效调优。

• We generalize sensitivity metrics for identifying salient columns in pre-trained LLMs. We compare various novel and existing instances of the proposed general form and identify a new metric, which on average outperform previously studied in the literature metrics(Xiao et al., 2023; Lee et al., 2024).

• 我们推广了用于识别预训练大语言模型中关键列的敏感性指标。通过比较所提出通用形式的多种新颖及现有实例,我们发现了一种新指标,该指标平均表现优于文献中先前研究的指标 (Xiao et al., 2023; Lee et al., 2024)。

• Experiments demonstrate that GIFT-SW outperforms modern PEFT methods and full finetuning baselines across most zero-shot tasks. GIFT-SW for LLaMA models achieve comparable accuracy to the corresponding stateof-the-art TÜLU2 models, despite fine-tuning only $3%$ of the parameters and utilizing ten times less computational resources.

• 实验表明,在大多数零样本任务中,GIFT-SW 的表现优于现代参数高效微调 (PEFT) 方法和全微调基线。对于 LLaMA 模型,GIFT-SW 仅微调了 $3%$ 的参数,并使用了少十倍的计算资源,但达到了与最先进的 TÜLU2 模型相当的准确率。

• We demonstrate that GIFT-SW is more stable with respect to a size of training set compared with low-rank adapters.

• 我们证明,与低秩适配器 (low-rank adapters) 相比,GIFT-SW 在训练集规模变化时表现更稳定。

2 Related Work

2 相关工作

2.1 Parameter efficient fine-tuning of LLM

2.1 大语言模型 (Large Language Model) 的参数高效微调

One of the most popular method with high efficiency is LoRA (Hu et al., 2021), which trains the low-rank adapters. Recent modifications to the method aim to improve the initialization of the adapters (Liu et al., 2024) and enhance the low-rank representation of pre-trained weights by adding sparse adapters (Nikdan et al., 2024). Another im- provement of the learning capacity of LoRA is given by DoRA (Liu et al., 2024), which fine-tunes magnitude and direction components of the pretrained weights. This method achieves considerable performance across various fine-tuning tasks.

目前最高效的流行方法之一是LoRA (Hu et al., 2021),该方法通过训练低秩适配器实现。近期改进聚焦于优化适配器初始化策略 (Liu et al., 2024),并通过添加稀疏适配器增强预训练权重的低秩表征能力 (Nikdan et al., 2024)。针对LoRA学习能力的另一项改进是DoRA (Liu et al., 2024),该方法对预训练权重的幅值和方向分量分别微调,在各类微调任务中展现出显著性能提升。

2.2 Salient Weights in LLMs

2.2 大语言模型中的显著权重

The identification of salient weights1 is one of the main problems in weight pruning. Recently, several approaches have been proposed to identify such weights in LLMs, including SparseGPT (Frantar and Alistarh, 2023), Wanda (Sun et al., 2023), and OWL (Yin et al., 2023).

权重剪枝中的核心问题之一是识别关键权重 (salient weights)。近期已有多种方法被提出用于在大语言模型中识别这类权重,包括 SparseGPT (Frantar and Alistarh, 2023)、Wanda (Sun et al., 2023) 和 OWL (Yin et al., 2023)。

Dettmers et al.’s (2022) demonstrated that a small subset of outliers in input activation s has a substantial impact on LLM performance, highlighting the relationship between the activation outliers and the salient weights. Many subsequent PostTraining Quantization (PTQ) methods used similar or identical pruning metrics to identify these salient weights (Dettmers et al., 2023; Xiao et al., 2023; Lee et al., 2024).

Dettmers等人(2022)的研究表明,输入激活中的少量离群值对大语言模型性能具有显著影响,揭示了激活离群值与关键权重之间的关系。后续许多训练后量化(PTQ)方法采用类似或相同的剪枝指标来识别这些关键权重 (Dettmers等人, 2023; Xiao等人, 2023; Lee等人, 2024)。

In our work, we generalize the identification metrics for salient weights by considering metrics from both the literature on pruning and quantization.

在我们的工作中,我们通过综合考虑剪枝和量化文献中的指标,推广了显著权重的识别指标。

2.3 Structured and Non-structured Salient Weights selection

2.3 结构化和非结构化显著权重选择

Since salient weights account for only a few percent of all the weights, a straightforward approach to preserve them would be to store unstructured salient weights in a sparse matrix. (Dettmers et al., 2023) demonstrated that this approach is computationally reasonable and leads to performance improvement. On the other hand, Xiao et al.’s (2023) revealed that outliers in activation s are confined to a small fraction of weight channels, which was incorporated into Smooth Quant, where outlier columns are identified using a small calibration dataset. This concept is further developed in QUIK (Ashkboos et al., 2023), where outlier columns are retained in full precision, while other columns are quantized using GPTQ (Frantar et al., 2022). A similar procedure is used in OWQ (Lee et al., 2024), but with an OBD-based metric (LeCun et al., 1989).

由于显著权重仅占所有权重的很小比例,直接保留它们的方法是将非结构化的显著权重存储在稀疏矩阵中。(Dettmers et al., 2023) 证明这种方法在计算上是合理的,并能带来性能提升。另一方面,Xiao et al. (2023) 发现激活中的异常值仅集中在少数权重通道中,这一发现被应用于 Smooth Quant 方法,其中通过小型校准数据集识别异常列。QUIK (Ashkboos et al., 2023) 进一步发展了这一概念,将异常列保留为全精度,而其他列则使用 GPTQ (Frantar et al., 2022) 进行量化。OWQ (Lee et al., 2024) 采用了类似流程,但使用了基于 OBD (LeCun et al., 1989) 的度量标准。

Due to the lack of results in the literature on which approach brings better results, structured or unstructured salient weight selection, and motivated by computational efficiency mentioned in (Ashkboos et al., 2023), in our work we follow the second line of work with structured column-wise salient weight selection.

由于文献中缺乏关于结构化与非结构化显著权重选择哪种方法效果更好的结论,并受到(Ashkboos et al., 2023)中计算效率观点的启发,我们在工作中采用了结构化列式显著权重选择的第二种研究路径。

2.4 Noise Injections

2.4 噪声注入

In this section, we briefly describe Gaussian Noise Injections (GNI) and its benefits. Then, we show that the approximation of quantization noise and GNI are identical. Therefore, GNI can also benefit further model quantization. Therefor, to examine our third question, we sample noise relative to quantization levels, leaving other sampling options for future work.

在本节中,我们简要介绍高斯噪声注入(Gaussian Noise Injections, GNI)及其优势,并证明量化噪声近似与GNI具有等价性。因此,GNI也能为后续模型量化带来增益。为验证第三个问题,我们采用与量化级别相关的噪声采样方案,其余采样方式留待未来研究。

Gaussian Noise Injections (GNI). Perturbed Gradient Descent (PGD) is a family of methods that involve adding or multiplying weights with samples from some random distribution, during an optimization procedure. Gaussian noise injection (GNI) after the gradient step helps model to escape saddle points efficiently in non-convex optimization (Jin et al., 2021). However, when Gaussian noise is injected before the gradient step, it helps model to escape from the spurious local optimum (Zhu et al., 2018).

高斯噪声注入 (GNI)。扰动梯度下降 (PGD) 是一类在优化过程中通过添加或乘以来自某种随机分布的样本来调整权重的方法。在梯度步骤后注入高斯噪声 (GNI) 有助于模型在非凸优化中高效逃离鞍点 (Jin et al., 2021)。然而,当高斯噪声在梯度步骤前注入时,它有助于模型逃离虚假局部最优 (Zhu et al., 2018)。

$$
\begin{array}{r l r}&{}&{\theta_{t+1}\leftarrow\theta_{t}-\tau(\nabla f(\theta_{t})+\xi)}\ &{}&{\theta_{t+1}\leftarrow\theta_{t}-\tau(\nabla f(\theta_{t}+\xi))}\ &{}&{\xi\sim\mathcal{N}(\mu,\sigma^{2})}\end{array}
$$

$$
\begin{array}{r l r}&{}&{\theta_{t+1}\leftarrow\theta_{t}-\tau(\nabla f(\theta_{t})+\xi)}\ &{}&{\theta_{t+1}\leftarrow\theta_{t}-\tau(\nabla f(\theta_{t}+\xi))}\ &{}&{\xi\sim\mathcal{N}(\mu,\sigma^{2})}\end{array}
$$

Moreover, practical benefits of noise injections are well documented in the literature and often can be discussed as regular iz ation techniques (Bishop, 1995; Srivastava et al., 2014; Camuto et al., 2020), methods to prompt adversarial robustenss (Panda and Roy, 2021) and to be used for data agumentation (Moreno-Barea et al., 2018).

此外,噪声注入的实际益处已在文献中得到充分证明,常被视作正则化技术 (Bishop, 1995; Srivastava et al., 2014; Camuto et al., 2020) 、对抗鲁棒性激发方法 (Panda and Roy, 2021) 以及数据增强手段 (Moreno-Barea et al., 2018) 进行讨论。

In our work we use GNI before evaluating the gradient. For this scenario, Orvieto et al. (2023) proposed to add noise only to one layer at training iteration to avoid variance explosion. It was empirically and theoretically demonstrated that GNI serves as a regular iz ation. Liu et al. (2023) study fine-tuning of pre-trained Language Models with GNI. Authors propose first to learn layer-wise variance parameters for noise distributions and then to fine-tune the model by adding noise to all the weights. The obtained results showed that the approach is superior to independent layer-wise noise injections.

在我们的工作中,我们在评估梯度前使用了GNI。针对这一场景,Orvieto等人 (2023) 提出在训练迭代中仅对单层添加噪声以避免方差爆炸。实验与理论均证明GNI能起到正则化作用。Liu等人 (2023) 研究了基于GNI的预训练语言模型微调方法,作者提出先学习噪声分布的逐层方差参数,再通过向所有权重添加噪声进行模型微调。结果表明该方法优于独立的逐层噪声注入策略。

Quantization Noise Injections (QNI). Quantization aware training (QAT) of networks is applied to mitigate their accuracy degradation after quantization. However, uniform quantization ^2Q is a non-differentiable operation. For simplicity, it can be expressed as a composition of scaling and rounding operations, begin{array}{r}{Q(\mathbf{W})=\Delta[\frac{\mathbf{W}}{\Delta}]}end{array} . In terms of QAT operation Q can be efficiently approximated with quantization noise \pmb{\Omega} such that pmb{\Omega}=Q(\mathbf{W})-\mathbf{W} (Défossez et al., 2021; Shvetsov et al., 2022; Shin et al., 2023). Thus, training models with QNI is exactly the same as employing PGD with GNI before evaluating the gradient.

量化噪声注入(QNI)。量化感知训练(QAT)被应用于缓解网络量化后的精度下降问题。但均匀量化^2Q是不可微操作,为简化表达可将其分解为缩放和取整运算begin{array}{r}{Q(\mathbf{W})=\Delta[\frac{\mathbf{W}}{\Delta}]}end{array}。在QAT操作中,量化噪声pmb{\Omega}可有效近似Q,即pmb{\Omega}=Q(\mathbf{W})-\mathbf{W} (Défossez等, 2021; Shvetsov等, 2022; Shin等, 2023)。因此,采用QNI训练模型等同于在梯度计算前执行带GNI的PGD。

Under some assumptions the noise pmb{\Omega} induced by uniform quantization can often be modeled by an additive noise that is uniformly distributed, uncorrelated with the input signal, and has a white spectrum (Widrow et al., 1996). However in practice, the conditions are often not satisfied. Therefore employing Gaussian distribution mathcal{N}(\mu,\sigma^{2}) for pmb{\Omega} typically yields improved outcomes (Défossez et al., 2021; Shvetsov et al., 2022).

在某些假设条件下,均匀量化引入的噪声 pmb{\Omega} 通常可建模为与输入信号不相关、具有均匀分布和白频谱特性的加性噪声 (Widrow et al., 1996) 。但实际场景中这些条件往往难以满足,因此采用高斯分布 mathcal{N}(\mu,\sigma^{2}) 描述 pmb{\Omega} 通常能获得更优结果 (Défossez et al., 2021; Shvetsov et al., 2022) 。

Although GNI is beneficial for model training there is no clear answer on how to choose noise parameters. Liu et al. (2023) determine noise parameters such that KL divergence between original and perturbed weights is minimized. Shin et al. (2023) identify parameters of the Gaussian distribution to resemble the weight distribution with a scale proportional to quantization step.

虽然GNI对模型训练有益,但如何选择噪声参数尚无明确答案。Liu等人(2023)通过最小化原始权重与扰动权重间的KL散度来确定噪声参数。Shin等人(2023)则通过识别高斯分布参数来模拟权重分布,其尺度与量化步长成比例。

2.5 Straight Through Estimator

2.5 直通估计器 (Straight Through Estimator)

The most popular QAT technique incorporating quantization operation into the traning process is Straight Through Estimation (\mathrm{STE})^{3} (Bengio et al., 2013; Shang et al., 2023), which basically reparameter ize s gradients. However, Défossez et al.’s (2021) demonstrated that STE has some disadvantages compared with mathsf{Q N I}^{4} , as STE is biased and may cause weight oscillation between quantization steps. Shin et al.’s (2023) demonstrated that pretraining models for the following quantization with QNI instead of STE results in better performance. More technical details are provided in Section C.

最流行的将量化操作融入训练过程的QAT技术是直通估计 (Straight Through Estimation, STE) (\mathrm{STE})^{3} (Bengio et al., 2013; Shang et al., 2023),其本质是对梯度进行重参数化。然而Défossez等人 (2021) 指出,相比 mathsf{Q N I}^{4},STE存在偏置问题且可能导致量化步长间的权重振荡。Shin等人 (2023) 的研究表明,使用QNI而非STE对模型进行量化前预训练能获得更优性能。更多技术细节详见附录C节。

3 Method

3 方法

GIFT-SW consists of the following steps:

GIFT-SW包含以下步骤:

Thus, the method depends on two main design choices: 1) how to choose salient columns and 2) the parameters of noise injections. We cover the choice of metrics in Section 3.1. Noise injection details are provided in Section 3.2.

因此,该方法依赖于两个主要设计选择:1) 如何选择显著列;2) 噪声注入的参数。我们将在第3.1节讨论指标的选择,噪声注入的细节将在第3.2节提供。

3.1 Generalizing parameter sensitivity metrics

3.1 泛化参数敏感性指标

Several approaches have been proposed recently to identify weights sensitive to quantization (Dettmers et al., 2023) or pruning (Sun et al., 2023). We generalize them as metrics for sensitivity to perturbations, and by applying these metrics, we determine which columns are more susceptible to degradation. Therefore, we avoid adding noise to such columns and use them to fine-tune the model.

最近提出了几种方法来识别对量化 (Dettmers et al., 2023) 或剪枝 (Sun et al., 2023) 敏感的权重。我们将这些方法推广为对扰动的敏感性度量,并通过应用这些度量来确定哪些列更容易受到性能下降的影响。因此,我们避免向这些列添加噪声,并利用它们来微调模型。


Figure 2: GIFT-SW procedure follows Equation 2. We first sample some noise, relative to quantization levels, then, perform forward pass, and then update salient weights only. In GIFT-SW, quantization, pruning or tensor decomposition can be applied to non-salient weights and then, salient weights can be fine-tuned effectively without changing non-salient weights structure. In our experiments we select only 128 columns of salient weights, unless specified otherwise.

图 2: GIFT-SW流程遵循公式2。我们首先采样相对于量化水平的噪声,执行前向传播,然后仅更新显著权重。在GIFT-SW中,可对非显著权重应用量化、剪枝或张量分解,随后无需改变非显著权重结构即可有效微调显著权重。实验中除非特别说明,我们仅选择128列显著权重。

The proposed sensitivity metric is written for a column j of weight matrix W as

所提出的敏感度度量针对权重矩阵W的第j列可表示为

s_{j}=|\mathbf{D}_{j}\|_{\tau}\|\mathbf{X}_{j}\|_{\rho}^{\gamma},

where mathbf{D}_{j} is a measure of weights perturbation, s_{j} denotes sensitivity of the column to perturbations, mathbf{X} is the input feature, and gamma takes on one of the following values 1/2,1,2 . As discussed in Section 2.4 we could apply GNI as a source of perturbations, then we would compute mathbf{D}{j}=mathbf{W}{:,j}+boldsymbol{xi} However, sampling noise xi is not deterministic. To approximate an influence of the noise xi we utilize perturbations caused by quantization.5 That would lead to mathbf{D}{j}=mathbf{W}{:,j}-Q(mathbf{W}{:,j}) , where Q(mathbf{W}{:,j})$ corresponds to the weights subjected to uniform symmetric quantization (see Appendix A).

其中 mathbf{D}_{j} 是权重扰动的度量,s_{j} 表示该列对扰动的敏感性,mathbf{X} 是输入特征,gamma 取以下值之一 1/2,1,2。如第2.4节所述,我们可以将GNI作为扰动源,此时计算 mathbf{D}{j}=mathbf{W}{:,j}+boldsymbol{xi}。然而采样噪声 xi 并非确定性的。为近似估计噪声 xi 的影响,我们采用量化引起的扰动[5],这将得到 mathbf{D}{j}=mathbf{W}{:,j}-Q(mathbf{W}{:,j}),其中 Q(mathbf{W}{:,j})$ 对应经过均匀对称量化的权重(参见附录A)。

The input feature mathbf{X} for each layer is computed using a number of random sentences from a calibration dataset. After that, sensitivity values s_{j} are estimated for individual columns. Columns with the highest values are identified as the salient columns. Some details about the calibration dataset is described in Section 4.1.

每层的输入特征mathbf{X}通过从校准数据集中随机选取若干句子计算得出。随后,为各列估计敏感度值s_{j},并将数值最高的列标记为显著列。关于校准数据集的更多细节见第4.1节。

The metric given by Equation 4 is closely related to those studied in the recent literature on quantization. In particular, the metric |mathbf{X}|_{infty} is employed in QUIK (Ashkboos et al., 2023) and Smooth Quant (Xiao et al., 2023). OWQ (Lee et al., 2024) adopts lambda_{j}|mathbf{D}_{j}\|_{2}^{2} , where lambda_{j}=|mathbf{X}_{j}|_{2}^{2} is the j -th diagonal element of the Hessian matrix mathbf{H} for the layer quantization error. It can be seen, that the sensitivity metric used in OWQ is a modification for column quantization of the salience measure provided in OBD (LeCun et al., 1989) for network pruning. A metric proposed in Wanda (Sun et al., 2023) is element-wise variant of the metric |{bf D}_{j}|_{1}|{bf X}_{j}|_{2} , which can be easily obtained from Equation 4 with pruning as a source of perturbations for mathbf{D}_{j}

方程4给出的度量指标与近期量化文献中研究的指标密切相关。具体而言,|mathbf{X}|_{infty} 被用于QUIK (Ashkboos等人,2023)和Smooth Quant (Xiao等人,2023)。OWQ (Lee等人,2024)采用 lambda_{j}|mathbf{D}_{j}|_{2}^{2} ,其中 lambda_{j}=|mathbf{X}_{j}|_{2}^{2} 是层量化误差Hessian矩阵 mathbf{H} 的第 j 个对角元素。可以看出,OWQ使用的敏感度指标是对OBD (LeCun等人,1989)网络剪枝中显著性度量的列量化改进。Wanda (Sun等人,2023)提出的指标是度量 |{bf D}_{j}|_{1}|{bf X}_{j}|_{2} 的逐元素变体,该指标可通过方程4将剪枝作为 mathbf{D}_{j} 的扰动源轻松推导得出。

In contrast to Wanda, we use l_{infty} norm in our general Equation 4 due to the following observations, examples contained in a calibration dataset induce different values of the input feature, a use of l_{2} norm leads to averaging of the values along input channels. Therefore, the appearance of the outlier values in the input activation can be obscured by a large number of lower values. The same conclusions can be also applied to the weight error. In the case of the l_{2} norm, the error for each channel includes all deviations between the quantized and original weights. Therefore, rare considerable errors can be mitigated by a large number of small deviations.

与Wanda不同,我们在通用公式4中采用l_{infty}范数,原因如下:校准数据集中的样本会导致输入特征值存在差异,使用l_{2}范数会沿输入通道对数值进行平均。因此,输入激活中的异常值可能被大量较低数值掩盖。这一结论同样适用于权重误差。对于l_{2}范数,每个通道的误差包含量化权重与原始权重间的所有偏差。因此,罕见的显著误差可能被大量微小偏差抵消。

3.2 Quantization Noise Injection

3.2 量化噪声注入

To improve our fine-tuning procedure with QNI, we avoid applying perturbations to sensitive weights. Therefore, after identifying columns that are sensitive to perturbations or salient during the finetuning stage, we inject quantization noise only into non-salient columns across all layers, as shown in Figure 2.

为改进采用QNI的微调过程,我们避免对敏感权重施加扰动。因此,在识别出对扰动敏感或在微调阶段显著的列后,我们仅在所有层的非显著列中注入量化噪声,如图2所示。

The scale parameters of the Gaussian noise are determined by the quantization step sizes, which are computed for each layer prior to the training process.

高斯噪声的尺度参数由量化步长决定,这些步长在训练前为每个层计算得出。

For the weight matrix W of a given layer in the model, the process of noise injection can be described as follows. During each forward pass in the training phase, we first sample elements of noise matrix pmb{Omega} from standard normal distribution mathcal{N}(0,1) . Subsequently, the matrix pmb{Omega} is scaled with the quantization step size Delta . Finally, we add scaled noise to weights of non-salient columns mathbf{W}_{[:,n o n-s a l i e n t]} . The operation of the noise injection mho is given as

对于模型中给定层的权重矩阵W,噪声注入的过程可描述如下。在训练阶段的每次前向传播时,我们首先从标准正态分布mathcal{N}(0,1)中采样噪声矩阵pmb{Omega}的元素。随后,用量化步长Delta对该矩阵pmb{Omega}进行缩放。最后,我们将缩放后的噪声加到非显著列mathbf{W}_{[:,non-salient]}的权重上。噪声注入操作mho的表达式为

\begin{array}{r}{\bar{\boldsymbol{\mathfrak{U}}}(\mathbf{W})=\left\{\begin{array}{l l}{\mathbf{W}_{[:,s a l i e n t]},}\ {\mathbf{W}_{[:,n o n-s a l i e n t]}+\frac{1}{2}\mathrm{diag}(\pmb{\Delta})\boldsymbol{\Omega}}\end{array}\right.,}\end{array}

where mathrm{diag}(pmb{Delta}) is the diagonal matrix with elements of the vector Delta .

其中 mathrm{diag}(pmb{Delta}) 是以向量 Delta 的元素为对角元素的对角矩阵。

Only weights of the salient columns mathbf{W}_{[:,s a l i e n t]} are updated during training, whereas weights of other columns mathbf{W}_{[:,n o n-s a l i e n t]} are frozen. We do not inject noise to salient weights since small perturbations in them can cause high model degradation.

训练期间仅更新显著列权重 mathbf{W}_{[:,s a l i e n t]},其他列权重 mathbf{W}_{[:,n o n-s a l i e n t]} 则保持冻结。由于显著权重的微小扰动可能导致模型性能大幅下降,因此不对其注入噪声。

The quantization step size Delta is determined only for weights in non-salient columns mathbf{W}_{[:,n o n-s a l i e n t]} To closer match the initial distribution of the weights, quantization scale factors including in Delta are estimated for each row individually. For i -s row the scale factor Delta_{i} is computed as:

量化步长 Delta 仅针对非显著列权重 mathbf{W}_{[:,non-salient]} 确定。为了更贴近权重的初始分布,包含在 Delta 中的量化比例因子逐行单独估算。对于第 i 行,比例因子 Delta_{i} 的计算公式为:

\Delta_{i}=\frac{\alpha_{i}}{2^{b-1}-1},

where b is the bit-width and alpha_{i} is the quantization parameter. As in quantization methods, smaller bitwidth b corresponds to higher quantization noise. The parameter alpha_{i} is estimated by optimizing weight error through linear search as discussed in Appendix A.

其中 b 为位宽,alpha_{i} 为量化参数。与量化方法类似,较小的位宽 b 对应较高的量化噪声。参数 alpha_{i} 的估计方法是通过附录A讨论的线性搜索优化权重误差来实现的。

Based on Equations 5 and 6, the variance of the injected noise is determined by the distribution of non-salient weights across rows. We exclude salient columns from this distribution, as the salient weights may induce large quantization error and distort row-wise scale factors. This approach helps us to minimize the noise variance, which, in turn, leads to a reduction in the deviation of the nonsalient weights during training.

根据公式5和6,注入噪声的方差由各行非显著权重的分布决定。我们将显著列排除在此分布之外,因为显著权重可能引发较大的量化误差并扭曲行级缩放因子。这种方法有助于最小化噪声方差,从而减少训练过程中非显著权重的偏差。

By sampling noise in such way we can use it for quantization pre-training experiments discussed in Section 6.3.

通过这种方式对噪声进行采样,我们可以将其用于第6.3节讨论的量化预训练实验。

4 Experiments

4 实验

In this section, we describe the experimental procedure used to test the performance of GIFT-SW compared to others.

在本节中,我们将描述用于测试GIFT-SW与其他方法相比性能的实验流程。

4.1 Data

4.1 数据

Following previous studies (Nikdan et al., 2024; Hu et al., 2021; Liu et al., 2024), we focus on the instruction tuning task. For this purpose, we use the TULU-V2-Mix as the main source of data (Ivison et al., 2023), as it encompasses a wide range of instructions from different sources. This dataset has been filtered, contains a substantial amount of data without being too large, and models tuned to this set show superior performance. Additionally, we utilize the OpenOrca dataset (Mukherjee et al., 2023) to demonstrate that our method does not depend on a specific set of instructions.

遵循先前研究 (Nikdan et al., 2024; Hu et al., 2021; Liu et al., 2024) ,我们专注于指令微调任务。为此,我们采用 TULU-V2-Mix 作为主要数据源 (Ivison et al., 2023) ,因其涵盖多来源的广泛指令。该数据集经过过滤,数据量充足且规模适中,基于其微调的模型展现出卓越性能。此外,我们使用 OpenOrca 数据集 (Mukherjee et al., 2023) 来证明本方法不依赖于特定指令集。

Table 1: Mean accuracy of LLaMA models fine-tuned with various instructive datasets and different methods.

表 1: 不同指令数据集和方法微调后的LLaMA模型平均准确率

LLaMA2-7b TULU-V2-mix OpenOrca LLaMA2-13b TULU-V2-mix OpenOrca LLaMA3-8b TULU-V2-mix OpenOrca
FT 71.97 71.88 75.09 75.21 76.13 77.02
LoRA DoRA 71.78 70.89 72.03 70.97 74.03 74.01 73.97 73.96 75.91 75.63 75.89 75.72
GIFT-SW 73.33 72.33 75.93 76.02 76.37 76.78

Table 2: Mean accuracy of quantized and then fine-tuned models. For fine-tuning we used TÜLU-V2-mix.

位数 方法 LLaMA2-7b LLaMA2-13b LLaMA3-8b
4 bit STE 72.43 75.29 74.84
QUIK + LORA 63.99 71.08 74.27
GIFT-SW 72.53 74.50 75.46
3 bit STE 69.82 74.37 70.24
QUIK+LORA 62.91 71.30 71.65
GIFT-SW 71.00 74.34 73.27
2bit STE 58.20 62.19 48.96
QUIK + LORA 41.44 47.14 53.80
GIFT-SW 61.09 67.61 58.89

表 2: 量化后微调模型的平均准确率。微调采用TÜLU-V2-mix数据集。

The sensitivity metrics to find salient columns are estimated based on 512 random sentences from the Pile validation dataset (Xiao et al., 2023).

用于识别显著列的敏感度指标基于Pile验证数据集中的512个随机句子进行估算 (Xiao et al., 2023)。

4.2 Baselines

4.2 基线方法

We consider several baselines for both full precision and quantized experiments. All baselines are applied to LLaMA2-7b, LLaMA2-13b and LLaMA3-8b.

我们为全精度和量化实验考虑了多个基线方法。所有基线均应用于LLaMA2-7b、LLaMA2-13b和LLaMA3-8b模型。

Full precision v