6주차_Optimizer choice

Methods and tools for efficient training on a single GPU

<aside>

Optimizer choice

The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay).

→ 트랜스포머 모델에서 가장 일반적으로 사용되는 옵티마이저는 Adam, AdamW

<aside>

Adam(Adaptive Moment Estimation):
- 확률적 경사 하강법(SGD) 기반의 옵티마이저
- 과거 기울기(gradient)의 지수 이동 평균을 사용하여 학습률을 조절
- 일반적으로 딥러닝에서 좋은 성능을 보임
AdamW:
- Adam과 유사하지만 Weight Decay(가중치 감쇠) 를 별도로 적용하여 과적합을 방지
- Adam은 L2 정규화를 적용하는 방식인데, AdamW는 직접 weight decay를 분리하여 적용하는 것이 차이점
- 대부분의 최신 모델 훈련에서 Adam보다 AdamW가 더 권장됨 </aside>

Adam achieves good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory footprint of the order of the number of model parameters.

→ **Adam**은 이전 기울기의 이동 평균을 저장하여 좋은 수렴을 달성하지만, 모델 파라미터 개수에 비례하는 추가적인 메모리 사용량을 발생시킨다

<aside>

Rolling average (이동 평균):
- 기울기(gradient)의 과거 정보를 **지수 이동 평균 방식**으로 저장하여 급격한 변화에 영향을 덜 받도록 함
Memory footprint (메모리 사용량):
- Adam은 1차 및 2차 모멘트(기울기의 평균 및 분산)를 저장하기 때문에 일반적인 SGD보다 메모리를 더 많이 사용함
- 모델 크기가 커질수록 메모리 부담이 증가 </aside>

To remedy this, you can use an alternative optimizer.

→ 메모리를 적게 사용하는 옵티마이저를 선택할 수 있음

<aside>

SGD (기울기만 저장) → 3B × 4 bytes = 12GB
Adam / AdamW (기울기 + 평균 + 분산) → 3B × 12 bytes = 36GB (FP32 기준)
Adam / AdamW (FP64 환경) → 3B × 24 bytes = 72GB </aside>

For example if you have NVIDIA/apex installed for NVIDIA GPUs, or ROCmSoftwarePlatform/apex for AMD GPUs, adamw_apex_fused will give you the fastest training experience among all supported AdamW optimizers.

→ NVIDIA/apex, ROCmSoftwarePlatform/apex 사용하면 좋음

<aside>

Apex:
- NVIDIA 및 AMD의 고속 연산 라이브러리
- 혼합 정밀도 연산(mixed-precision training) 및 고속 연산을 지원하여 학습 속도를 향상
adamw_apex_fused:
- Apex에서 제공하는 AdamW 변형
- 병렬 연산 및 GPU 최적화를 적용하여 일반적인 AdamW보다 빠름 </aside>

Trainer integrates a variety of optimizers that can be used out of box: adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision, adafactor, or adamw_bnb_8bit. More optimizers can be plugged in via a third-party implementation.

<aside>

Hugging Face의 Trainer:
- 트랜스포머 모델 학습을 쉽게 할 수 있도록 설계된 고수준 API
- 여러 가지 옵티마이저를 기본 지원하여 편리
각 옵티마이저 설명:
- adamw_hf: Hugging Face에서 최적화한 AdamW
- adamw_torch: PyTorch 기본 AdamW
- adamw_torch_fused: PyTorch의 고속 병렬 연산을 활용한 AdamW
- adamw_apex_fused: NVIDIA/AMD Apex 라이브러리를 사용한 AdamW (가장 빠름)
- adamw_anyprecision: 임의의 정밀도를 선택할 수 있는 AdamW
- adafactor: 메모리 효율이 뛰어난 Adam 변형 (대형 모델에서 유리)
- adamw_bnb_8bit: 8비트 연산을 사용하는 AdamW (메모리 절약) </aside>

Let’s take a closer look at two alternatives to AdamW optimizer:

→ AdamW 옵티마이저의 두 가지 대안

adafactor which is available in Trainer
adamw_bnb_8bit is also available in Trainer, but a third-party integration is provided below for demonstration.

For comparison, for a 3B-parameter model, like “google-t5/t5-3b”:

A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB)

→ 8바이트 * 3B(30억 개의 파라미터)
Adafactor optimizer will need more than 12GB. It uses slightly more than 4 bytes for each parameter, so 4*3 and then some extra.
8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized.

→ 2 * 3B (24GB 대비 75% 절약)

</aside>

<aside>

Adafactor

Adafactor doesn’t store rolling averages ~~for each element in weight matrices~~.

→ 각 요소에 대한 이동 평균을 전부 저장하지 않음

Instead, it keeps aggregated information (sums of rolling averages row- and column-wise), significantly reducing its footprint.

→ 대신 행과 열 단위로 이동 평균의 합을 저장하여 메모리 사용량을 줄임

옵티마이저	이동 평균 저장 방식	메모리 사용량
Adam / AdamW	모든 개별 요소의 평균과 분산을 저장	높음
Adafactor	행(row)과 열(column) 단위로 요약된 정보만 저장	적음

However, compared to Adam, Adafactor may have slower convergence in certain cases.

→ 그러나 특정 경우에서 수렴 속도가 더 느릴 수 있음

→ 모든 요소의 이동 평균을 저장하지 않기 때문에 수렴 속도 느려질 수 있다 생각

→ 기울기 변화에 더 민감한 소규모 데이터셋이나 작은 모델에서는 비효율적

You can switch to Adafactor by setting optim="adafactor" in TrainingArguments:

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    optim="adafactor",
    **default_args
)

Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training) you can notice up to 3x improvement while maintaining the throughput!

→ Gradient accumulation, gradient checkpointing, mixed precision training과 함께 사용하면 최대 3배의 성능 개선을 유지하면서도 학습 속도를 향상

→ 세 방법 모두 결국은 메모리를 줄이기 위함!!

However, as mentioned before, the convergence of Adafactor can be worse than Adam.

</aside>

<aside>

8-bit Adam

Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it.

→ 8-bit Adam은 전체 상태를 유지한 채 양자화

Quantization means that it stores the state with lower precision and dequantizes it only for the optimization.

→ 양자화는 낮은 정밀도로 저장하고, 최적화 과정에서만 다시 높은 정밀도로 변환

<aside>

</aside>

This is similar to the idea behind mixed precision training.

To use adamw_bnb_8bit, you simply need to set optim="adamw_bnb_8bit" in TrainingArguments:

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    optim="adamw_bnb_8bit",
    **default_args
)

However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated.

→ 직접 bitsandbytes 라이브러리를 사용하여 커스텀 옵티마이저를 만들 수도 있음

First, follow the installation guide in the GitHub repo to install the bitsandbytes library that implements the 8-bit Adam optimizer.

pip install bitsandbytes

Next you need to initialize the optimizer. This involves two steps:

First, group the model’s parameters into two groups - one where weight decay should be applied, and the other one where it should not. Usually, biases and layer norm parameters are not weight decayed.
Then do some argument housekeeping to use the same parameters as the previously used AdamW optimizer.

Finally, pass the custom optimizer as an argument to the Trainer:

trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))

Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training), you can expect to get about a 3x memory improvement and even slightly higher throughput as using Adafactor.

→ 요약 없이 모든 정보를 저장하기 때문에(비록 양자화 하지만) 조금 더 성능이 좋을 수 있을 거라 생각

</aside>

<aside>

</aside>