12.2. How to control the parallelism of a job?

DeePMD-kit has three levels of parallelism. To get the best performance, one should control the number of threads used by DeePMD-kit. One should make sure the product of the parallel numbers is less than or equal to the number of cores available.

12.2.1. MPI (optional)

Parallelism for MPI is optional and used for multiple nodes, multiple GPU cards, or sometimes multiple CPU cores.

To enable MPI support for training, one should install horovod in advance. Note that the parallelism mode is data parallelism, so it is not expected to see the training time per batch decreases.

MPI support for inference is not directly supported by DeePMD-kit, but indirectly supported by the third-party software. For example, LAMMPS enables running simulations in parallel using the MPI parallel communication standard with distributed data. That software has to build against MPI.

Set the number of processes with:

mpirun -np $num_nodes dp

Note that mpirun here should be the same as the MPI used to build software. For example, one can use mpirun --version and lmp -h to see if mpirun and LAMMPS has the same MPI version.

Sometimes, $num_nodes and the nodes information can be directly given by the HPC scheduler system, if the MPI used here is the same as the MPI used to build the scheduler system. Otherwise, one have to manually assign these information.

12.2.2. Parallelism between independent operators

For CPU devices, TensorFlow and PyTorch use multiple streams to run independent operators (OP).

export DP_INTER_OP_PARALLELISM_THREADS=3

However, for GPU devices, TensorFlow uses only one compute stream and multiple copy streams. Note that some of DeePMD-kit OPs do not have GPU support, so it is still encouraged to set environmental variables even if one has a GPU.

12.2.3. Parallelism within an individual operators

For CPU devices, DP_INTRA_OP_PARALLELISM_THREADS controls parallelism within TensorFlow (when TensorFlow is built against Eigen) and PyTorch native OPs.

export DP_INTRA_OP_PARALLELISM_THREADS=2

OMP_NUM_THREADS is the number of threads for OpenMP parallelism. It controls parallelism within TensorFlow (when TensorFlow is built upon Intel OneDNN) and PyTorch (when PyTorch is built upon OpenMP) native OPs and DeePMD-kit custom CPU OPs. It may also control parallelism for NumPy when NumPy is built against OpenMP, so one who uses GPUs for training should also care this environmental variable.

export OMP_NUM_THREADS=2

There are several other environmental variables for OpenMP, such as KMP_BLOCKTIME.

See Intel documentation for detailed information.

See PyTorch documentation for detailed information.

12.2.4. Tune the performance

There is no one general parallel configuration that works for all situations, so you are encouraged to tune parallel configurations yourself after empirical testing.

Here are some empirical examples. If you wish to use 3 cores of 2 CPUs on one node, you may set the environmental variables and run DeePMD-kit as follows:

export OMP_NUM_THREADS=3
export DP_INTRA_OP_PARALLELISM_THREADS=3
export DP_INTER_OP_PARALLELISM_THREADS=2
dp train input.json

For a node with 128 cores, it is recommended to start with the following variables:

export OMP_NUM_THREADS=16
export DP_INTRA_OP_PARALLELISM_THREADS=16
export DP_INTER_OP_PARALLELISM_THREADS=8

Again, in general, one should make sure the product of the parallel numbers is less than or equal to the number of cores available. In the above case, \(16 \times 8 = 128\), so threads will not compete with each other.