Megatron-SWIFT를 위한 필수 패키지 설치 및 효율적 학습 가이드

최근 대형 AI 모델 학습에서 Megatron-SWIFT의 활용도가 급증하고 있습니다. 이 글은 사용자의 요청을 반영하여 필수 패키지 설치 방법과 함께, 실전 환경 구축 및 효율적 학습 팁을 한 화면에 담았습니다.

1. 환경 준비: 권장 버전 안내

운영체제: Ubuntu 20.04 이상
Python: 3.9 또는 3.10 (venv 또는 conda 가상환경 사용 권장)
CUDA: 11.8 이상, 최신 NVIDIA GPU
PyTorch: 2.5 또는 2.6 (권장 버전)
필수 라이브러리: pybind11, transformer_engine, apex, megatron-core 등

2. 상세 패키지 설치 과정

아래 명령어를 순차적으로 실행하면 Megatron-SWIFT 학습에 필요한 환경을 쉽게 구축할 수 있습니다.

# PyTorch 권장 버전 설치 (2.5 또는 2.6)
pip install torch==2.6.0

# pybind11 설치
pip install pybind11

# transformer_engine (설치 오류 시 이슈 참고: https://github.com/modelscope/ms-swift/issues/3793)
pip install git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.3
# 위 명령이 실패할 경우
pip install --no-build-isolation transformer_engine[pytorch]

# apex 설치
git clone https://github.com/NVIDIA/apex
cd apex
# 특정 커밋으로 체크아웃 (https://github.com/modelscope/ms-swift/issues/4176 참고)
git checkout e13873debc4699d39c6861074b9a3b2a02327f92
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# megatron-core 설치
pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.12.0

# swift 및 기타 부가 패키지 설치
pip install swift

설치 시 발생하는 오류는 공식 FAQ 또는 GitHub Issue를 참고하면 해결할 수 있습니다.

3. Docker 이미지 활용

Docker를 쓰면 환경 구축이 더 쉬워집니다.

docker pull infotonic/megatron-swift:latest
docker run --gpus all -it infotonic/megatron-swift:latest

4. 지원 모델 및 병렬화 옵션

Qwen, Llama3, Deepseek 등 주요 모델의 파인튜닝 & 사전학습을 지원하며, 다양한 병렬화(데이터/텐서/파이프라인/MoE 등)를 적용할 수 있습니다.

5. Hugging Face ↔ Megatron-Core 모델 변환 예시

CUDA_VISIBLE_DEVICES=0 swift export 
  --model Qwen/Qwen2.5-7B-Instruct 
  --to_mcore true 
  --torch_dtype bfloat16 
  --output_dir Qwen2.5-7B-Instruct-mcore

6. 실제 파인튜닝 실행 예시

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' 
NPROC_PER_NODE=2 
CUDA_VISIBLE_DEVICES=0,1 
megatron sft 
--load Qwen2.5-7B-Instruct-mcore 
... (필요 옵션 및 데이터셋) ...

7. 체크포인트 관리 및 추론

CUDA_VISIBLE_DEVICES=0 swift infer 
  --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf 
  --stream true 
  --temperature 0 
  --max_new_tokens 2048

8. 추가 참고 및 실전 가이드

데이터셋 스트리밍, 시퀀스 패킹, RLHF(DPO, IPO) 파인튜닝 옵션 등 고급 기능 사용법
FAQ, 이슈 트래킹(GitHub 참고)으로 설치 문제 해결
주요 하이퍼파라미터 및 병렬화 옵션 한눈에 정리

자세한 문서와 더 많은 팁: Megatron-SWIFT Training — swift 3.6.4 documentation