Methods and tools for efficient training on a single GPU

<aside>

Optimizer choice

The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay).

→ 트랜스포머 모델에서 가장 일반적으로 사용되는 옵티마이저는 Adam, AdamW

<aside>

Adam achieves good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory footprint of the order of the number of model parameters.

→ **Adam**은 이전 기울기의 이동 평균을 저장하여 좋은 수렴을 달성하지만, 모델 파라미터 개수에 비례하는 추가적인 메모리 사용량을 발생시킨다

<aside>

To remedy this, you can use an alternative optimizer.

→ 메모리를 적게 사용하는 옵티마이저를 선택할 수 있음

<aside>

For example if you have NVIDIA/apex installed for NVIDIA GPUs, or ROCmSoftwarePlatform/apex for AMD GPUs, adamw_apex_fused will give you the fastest training experience among all supported AdamW optimizers.

NVIDIA/apex, ROCmSoftwarePlatform/apex 사용하면 좋음

<aside>

Trainer integrates a variety of optimizers that can be used out of box: adamw_hfadamw_torchadamw_torch_fusedadamw_apex_fusedadamw_anyprecisionadafactor, or adamw_bnb_8bit. More optimizers can be plugged in via a third-party implementation.

<aside>

Let’s take a closer look at two alternatives to AdamW optimizer:

→ AdamW 옵티마이저의 두 가지 대안

  1. adafactor which is available in Trainer
  2. adamw_bnb_8bit is also available in Trainer, but a third-party integration is provided below for demonstration.

For comparison, for a 3B-parameter model, like “google-t5/t5-3b”:

</aside>

<aside>

Adafactor

Adafactor doesn’t store rolling averages for each element in weight matrices.

→ 각 요소에 대한 이동 평균을 전부 저장하지 않음

Instead, it keeps aggregated information (sums of rolling averages row- and column-wise), significantly reducing its footprint.

→ 대신 행과 열 단위로 이동 평균의 합을 저장하여 메모리 사용량을 줄임

옵티마이저 이동 평균 저장 방식 메모리 사용량
Adam / AdamW 모든 개별 요소의 평균과 분산을 저장 높음
Adafactor 행(row)과 열(column) 단위로 요약된 정보만 저장 적음

However, compared to Adam, Adafactor may have slower convergence in certain cases.

→ 그러나 특정 경우에서 수렴 속도가 더 느릴 수 있음

→ 모든 요소의 이동 평균을 저장하지 않기 때문에 수렴 속도 느려질 수 있다 생각

→ 기울기 변화에 더 민감한 소규모 데이터셋이나 작은 모델에서는 비효율적

You can switch to Adafactor by setting optim="adafactor" in TrainingArguments:

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    optim="adafactor",
    **default_args
)

Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training) you can notice up to 3x improvement while maintaining the throughput!

→ Gradient accumulation, gradient checkpointing, mixed precision training과 함께 사용하면 최대 3배의 성능 개선을 유지하면서도 학습 속도를 향상

→ 세 방법 모두 결국은 메모리를 줄이기 위함!!

However, as mentioned before, the convergence of Adafactor can be worse than Adam.

</aside>

<aside>

8-bit Adam

Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it.

→ 8-bit Adam은 전체 상태를 유지한 채 양자화

Quantization means that it stores the state with lower precision and dequantizes it only for the optimization.

→ 양자화는 낮은 정밀도로 저장하고, 최적화 과정에서만 다시 높은 정밀도로 변환

<aside>

</aside>

This is similar to the idea behind mixed precision training.

To use adamw_bnb_8bit, you simply need to set optim="adamw_bnb_8bit" in TrainingArguments:

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    optim="adamw_bnb_8bit",
    **default_args
)

However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated.

→ 직접 bitsandbytes 라이브러리를 사용하여 커스텀 옵티마이저를 만들 수도 있음

First, follow the installation guide in the GitHub repo to install the bitsandbytes library that implements the 8-bit Adam optimizer.

pip install bitsandbytes

Next you need to initialize the optimizer. This involves two steps:

Finally, pass the custom optimizer as an argument to the Trainer:

trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))

Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training), you can expect to get about a 3x memory improvement and even slightly higher throughput as using Adafactor.

→ 요약 없이 모든 정보를 저장하기 때문에(비록 양자화 하지만) 조금 더 성능이 좋을 수 있을 거라 생각

</aside>

<aside>

</aside>