Training in FP16 and BF16¶
Hopwise supports mixed-precision training using FP16 (float16) and BF16 (bfloat16) to reduce memory usage and speed up training on compatible hardware.
Overview¶
Mixed-precision training uses lower precision (16-bit) floating point numbers for most operations while maintaining full precision (32-bit) for critical computations. This provides:
Reduced memory usage: Models use approximately half the GPU memory
Faster training: Modern GPUs have dedicated tensor cores for 16-bit operations
Maintained accuracy: Critical operations still use full precision
Configuration¶
Enable half-precision training in your configuration:
# FP16 training (recommended for NVIDIA GPUs)
train_precision: fp16
# BF16 training (recommended for newer GPUs like A100, RTX 40 series)
train_precision: bf16
Or via command line:
hopwise train --model BPR --dataset ml-100k --train_precision=fp16
Or in Python:
from hopwise.quick_start import run_hopwise
config_dict = {
'train_precision': 'fp16',
}
run_hopwise(model='BPR', dataset='ml-100k', config_dict=config_dict)
Choosing Between FP16 and BF16¶
Precision |
Advantages |
Best For |
|---|---|---|
FP16 |
Widest GPU support |
V100, RTX 20/30 series |
BF16 |
Better numerical stability, no loss scaling needed |
A100, H100, RTX 40 series |
Note
BF16 requires NVIDIA Ampere architecture (compute capability 8.0+) or newer.
Gradient Scaling (FP16)¶
When using FP16, hopwise automatically applies gradient scaling to prevent underflow:
# This is handled automatically, but you can customize:
config_dict = {
'train_precision': 'fp16',
'grad_scaler_init_scale': 65536.0, # Initial scale factor
}
Best Practices¶
Start with FP32: Verify your model works in full precision first
Monitor loss: Watch for NaN or inf values during training
Adjust learning rate: You may need slightly different learning rates
Use BF16 if available: It’s more stable and doesn’t require gradient scaling
Troubleshooting¶
NaN losses with FP16:
Try reducing the learning rate
Use BF16 instead if your GPU supports it
Increase the gradient scaler’s initial scale
Slow training:
Ensure your GPU has tensor cores (V100, A100, RTX series)
Check that CUDA is properly configured
Out of memory:
Half precision should reduce memory by ~50%
Combine with gradient checkpointing for larger models