GPU Parallelism

Training Large Models on Multiple GPUs

https://lilianweng.github.io/posts/2021-09-25-train-large/

bottleneck of data, solution: improve token efficiency