Questions
How do you use the `accelerate` library to train a model on multiple GPUs?
The Scenario
You are an ML engineer at a self-driving car company. You are training a large computer vision model on a dataset of millions of images. The training is taking weeks to complete on a single GPU, which is slowing down your team’s development cycle.
Your manager has given you access to a new server with 8 NVIDIA A100 GPUs. Your task is to modify the existing training script to use all 8 GPUs and reduce the training time to less than 2 days.
The current training script is a standard PyTorch training loop.
The Challenge
Explain how you would use the Hugging Face accelerate library to modify the training script to run on multiple GPUs. What are the key benefits of using accelerate over other distributed training libraries?
A junior engineer might try to implement distributed training from scratch using PyTorch's `DistributedDataParallel`. This would be a complex and time-consuming task, and it would be easy to make mistakes. They might not be aware of the `accelerate` library, which is a much simpler and more robust solution.
A senior engineer would immediately recommend using the `accelerate` library. They would be able to explain how to use `accelerate` to modify the existing training script with minimal code changes. They would also be able to explain the key benefits of using `accelerate`, such as its ease of use, its support for different types of hardware, and its integration with the Hugging Face ecosystem.
Step 1: Why accelerate?
Before we dive into the code, let’s compare accelerate with PyTorch’s native DistributedDataParallel (DDP).
| Feature | accelerate | DistributedDataParallel (DDP) |
|---|---|---|
| Ease of use | Very easy to use, requires only a few lines of code changes. | More complex to set up, requires more boilerplate code. |
| Hardware Support | Supports a wide variety of hardware, including GPUs, TPUs, and multiple machines. | Primarily designed for GPUs, requires more effort to use with other hardware. |
| Integration | Tightly integrated with the Hugging Face ecosystem. | Not integrated with the Hugging Face ecosystem. |
| Flexibility | Provides a high-level API that is easy to use, but can be less flexible than DDP. | Provides a low-level API that is very flexible, but can be more difficult to use. |
For our use case, accelerate is the best choice. It is easy to use, supports our hardware, and is well-integrated with the Hugging Face ecosystem.
Step 2: Modifying the Training Script
Here’s how we can modify the existing training script to use accelerate:
Original Script:
import torch
from torch.utils.data import DataLoader
# ...
device = torch.device("cuda")
model.to(device)
for epoch in range(num_epochs):
for batch in train_dataloader:
optimizer.zero_grad()
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()Modified Script:
from accelerate import Accelerator
from torch.utils.data import DataLoader
# ...
# 1. Initialize accelerator
accelerator = Accelerator()
# 2. Prepare model, optimizer, and data loaders
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
)
for epoch in range(num_epochs):
for batch in train_dataloader:
optimizer.zero_grad()
# No need to move batch to device, accelerator handles it
outputs = model(**batch)
loss = outputs.loss
# 3. Use accelerator.backward()
accelerator.backward(loss)
optimizer.step()As you can see, we only need to add a few lines of code to use accelerate. The accelerator.prepare method handles all the device placement and model wrapping for us.
Step 3: Launching the Training
To launch the training on multiple GPUs, we first need to configure accelerate by running accelerate config in the terminal. This will ask a few questions about our setup and create a configuration file.
Then, we can launch the training with the following command:
accelerate launch your_script.pyaccelerate will automatically handle the process spawning and communication for us.
Practice Question
You want to train your model on a TPU. Which of the following would you use?