2️⃣Bitsandbytes

Huggingface Bitsandbytes

What is Bitsandbytes?

Quantization(양자화)는 모델에서 숫자 값의 정밀도를 낮추는 데 사용되는 기술입니다.

양자화는 32비트 부동 소수점 숫자와 같은 고정밀 데이터 유형을 사용하는 대신 8비트 정수와 같은 저정밀 데이터 유형을 사용하여 값을 나타냅니다. 이 프로세스는 메모리 사용량을 크게 줄이고 모델 실행 속도를 높이면서도 허용 가능한 정확도를 유지할 수 있습니다.

모델 정량화 프로세스에 더 쉽게 접근할 수 있도록 Hugging Face는 Bitsandbytes 라이브러리와 원활하게 통합했습니다. 이러한 통합은 양자화 프로세스를 간소화하고 사용자가 단 몇 줄의 코드만으로 효율적인 모델을 구축할 수 있도록 지원합니다.

Quantization
Integrations
Optimizers

%pip install transformer bitsandbytes accelerate

Load 4-bit Quantization

이 통합의 주요 기능 중 하나는 4비트 양자화 모델을 로드할 수 있는 기능입니다. from_pretrained메서드를 호출할 때load_in_4bit=True` 인수를 설정하면 이 작업을 수행할 수 있습니다. 이렇게 하면 메모리 사용량을 약 4배까지 줄일 수 있습니다.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bigscience/bloom-1b7"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    load_in_4bit=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bigscience/bloom-1b7"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    load_in_8bit=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

불러온 model의 메모리 용량을 확인해 보겠습니다.

print('4bit Model:', model_4bit.get_memory_footprint())
print('8bit Model:', model_8bit.get_memory_footprint())

4bit Model: 1632878592
8bit Model: 2236858368

Extra Cases

Model을 로드할 때 4bit, 8bit 양자화 지정 외에 다른 기능을 활용할 수 있습니다.

Compute Data type

bnb_4bit_compute_dtype 인수에서 다른 값(예: torch.bfloat16)으로 설정하여 계산 중에 사용되는 데이터 유형을 수정할 수 있습니다.

이렇게 하면 특정 시나리오에서 속도가 향상될 수 있습니다. 다음은 예시입니다:

import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_compute_dtype=torch.bfloat16
)

NF4 Data type

NF4 데이터 유형은 정규 분포를 사용하여 초기화된 가중치를 위해 설계되었습니다. bnb_4bit_quant_type='nf4'를 지정하여 사용할 수 있습니다:

from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4"
)

model_nf4 = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=nf4_config
)

`low_cpu_mem_usage` was None, now set to True since model is quantized.

Nested Quantization for Memory Efficiency

성능 저하 없이 메모리 효율성을 더욱 높이기 위해 중첩 양자화 기법을 사용할 것을 권장합니다. 이 기술은 특히 대규모 모델을 미세 조정할 때 유용하다는 것이 입증되었습니다:

from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True
)

model_double_quant = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=double_quant_config
)

`low_cpu_mem_usage` was None, now set to True since model is quantized.

Quantized Model from the Hub

정량화된 모델은 from_pretrained 메서드를 사용하여 쉽게 로드할 수 있습니다.

저장된 가중치가 양자화되었는지 확인하려면 모델 구성에서 quantization_config 속성을 확인합니다:

model = AutoModelForCausalLM.from_pretrained(
    "<model_name>", 
    device_map="auto"
)

Support 8-bit Optimizer

Bitsandbytes는 다음과 같은 최적화 프로그램을 지원합니다:

Adagrad, Adagrad8bit, Adagrad32bit
Adam, Adam8bit, Adam32bit, PagedAdam, PagedAdam8bit, PagedAdam32bit
AdamW, AdamW8bit, AdamW32bit, PagedAdamW, PagedAdamW8bit, PagedAdamW32bit
LAMB, LAMB8bit, LAMB32bit
LARS, LARS8bit, LARS32bit, PytorchLARS
Lion, Lion8bit, Lion32bit, PagedLion, PagedLion8bit, PagedLion32bit
RMSprop, RMSprop8bit, RMSprop32bit
SGD, SGD8bit, SGD32bit

PreviousAccelerator NextFlash Attention

Last updated 1 year ago