Advanced options

5.2. Advanced options#

In this section, we will take $deepmd_source_dir/examples/water/se_e2_a/input.json as an example of the input file.

5.2.1. Learning rate#

See Learning rate for detailed documentation on learning rate schedules.

5.2.2. Optimizer#

The optimizer section in input.json is given as follows

    "optimizer" :{
        "type": "Adam",
        "_comment": "that's all"
    }

TensorFlow/Paddle: only Adam is supported.
PyTorch: Adam, AdamW, LKF, AdaMuon, HybridMuon.
adam_beta1 and adam_beta2 control the Adam/AdamW moment decay.
weight_decay applies L2 penalty in Adam, while weight_decay is decoupled in AdamW. TensorFlow does not support weight decay in Adam.

5.2.3. Training parameters#

Other training parameters are given in the training section.

    "training": {
        "training_data": {
            "systems":    ["../data_water/data_0/", "../data_water/data_1/", "../data_water/data_2/"],
            "batch_size": "auto"
        },
        "validation_data":{
            "systems":    ["../data_water/data_3"],
            "batch_size": 1,
            "numb_btch":  3
        },
        "mixed_precision": {
            "output_prec":  "float32",
            "compute_prec": "float16"
        },

        "numb_steps": 1000000,
        "seed":       1,
        "disp_file":  "lcurve.out",
        "disp_freq":  100,
        "save_freq":  1000
    }

The sections training_data and validation_data give the training dataset and validation dataset, respectively. Taking the training dataset for example, the keys are explained below:

systems provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a list or a str.
- str: systems should be a valid path. It can be a system directory path (containing ‘type.raw’) or a parent directory path to recursively search for all system subdirectories.
- list: systems gives a list of paths. Each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories.
At each training step, DeePMD-kit randomly picks batch_size frame(s) from one of the systems. The probability of using a system is by default in proportion to the number of batches in the system. More options are available for automatically determining the probability of using systems. One can set the key auto_prob to
- "prob_uniform" all systems are used with the same probability.
- "prob_sys_size" the probability of using a system is proportional to its size (number of frames).
- "prob_sys_size; sidx_0:eidx_0:w_0; sidx_1:eidx_1:w_1;..." the list of systems is divided into blocks. Block i has systems ranging from sidx_i to eidx_i. The probability of using a system from block i is proportional to w_i. Within one block, the probability of using a system is proportional to its size.
An example of using "auto_prob" is given below. The probability of using systems[2] is 0.4, and the sum of the probabilities of using systems[0] and systems[1] is 0.6. If the number of frames in systems[1] is twice of system[0], then the probability of using system[1] is 0.4 and that of system[0] is 0.2.

    "training_data": {
        "systems":    ["../data_water/data_0/", "../data_water/data_1/", "../data_water/data_2/"],
        "auto_prob":  "prob_sys_size; 0:2:0.6; 2:3:0.4",
        "batch_size": "auto"
    }

The probability of using systems can also be specified explicitly with key sys_probs which is a list having the length of the number of systems. For example

    "training_data": {
        "systems":    ["../data_water/data_0/", "../data_water/data_1/", "../data_water/data_2/"],
        "sys_probs":  [0.5, 0.3, 0.2],
        "batch_size": "auto:32"
    }

The key batch_size specifies the number of frames used to train or validate the model in a training step. It can be set to
- list: the length of which is the same as the systems. The batch size of each system is given by the elements of the list.
- int: all systems use the same batch size.
- "auto": the same as "auto:32", see "auto:N"
- "auto:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than N.
- "max:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no more than N. The minimum batch size is 1. Supported backends: PyTorch , Paddle
- "filter:N": the same as "max:N" but removes the systems with the number of atoms larger than N from the data set. Throws an error if no system is left in a dataset. Supported backends: PyTorch , Paddle
The key numb_batch in validate_data gives the number of batches of model validation. Note that the batches may not be from the same system

The section mixed_precision specifies the mixed precision settings, which will enable the mixed precision training workflow for DeePMD-kit. The keys are explained below:

output_prec precision used in the output tensors, only float32 is supported currently.
compute_prec precision used in the computing tensors, only float16 is supported currently. Note there are several limitations about mixed precision training:
Only se_e2_a type descriptor is supported by the mixed precision training workflow.
The precision of the embedding net and the fitting net are forced to be set to float32.

Other keys in the training section are explained below:

numb_steps The number of training steps.
seed The random seed for getting frames from the training data set.
disp_file The file for printing learning curve.
disp_freq The frequency of printing learning curve. Set in the unit of training steps
save_freq The frequency of saving checkpoint.
save_dir The directory where periodic checkpoints are written (PyTorch backend). It is created recursively if missing, while the model.ckpt.pt symlinks and the checkpoint pointer file stay in the working directory. Defaults to the working directory.
ckpt_keep_ratio An alternative to max_ckpt_keep (PyTorch backend) that keeps a sliding window of ceil(ckpt_keep_ratio * ceil(numb_steps / save_freq)) most recent checkpoints, i.e. the final ckpt_keep_ratio fraction of the run by step. It overrides max_ckpt_keep (and ema_ckpt_keep) when set, and works the same whether the run length is given by numb_steps or numb_epoch.

5.2.4. Options and environment variables#

Several command line options can be passed to dp train, which can be checked with

$ dp train --help

An explanation will be provided

usage: dp train [-h] [-v {DEBUG,3,INFO,2,WARNING,1,ERROR,0}] [-l LOG_PATH]
                [-m {master,collect,workers}]
                [-i INIT_MODEL | -r RESTART | -f INIT_FRZ_MODEL | -t FINETUNE]
                [--use-pretrain-script] [-o OUTPUT] [--skip-neighbor-stat]
                [--model-branch MODEL_BRANCH] [--force-load]
                INPUT

positional arguments:
  INPUT                 the input parameter file in json or yaml format

options:
  -h, --help            show this help message and exit
  -v {DEBUG,3,INFO,2,WARNING,1,ERROR,0}, --log-level {DEBUG,3,INFO,2,WARNING,1,ERROR,0}
                        set verbosity level by string or number, 0=ERROR, 1=WARNING, 2=INFO and 3=DEBUG (default: INFO)
  -l LOG_PATH, --log-path LOG_PATH
                        set log file to log messages to disk, if not specified, the logs will only be output to console (default: None)
  -m {master,collect,workers}, --mpi-log {master,collect,workers}
                        Set the manner of logging when running with MPI. 'master' logs only on main process, 'collect' broadcasts logs from workers to master and 'workers' means each process will output its own log (default: master)
  -i INIT_MODEL, --init-model INIT_MODEL
                        Initialize the model by the provided path prefix of checkpoint files. (default: None)
  -r RESTART, --restart RESTART
                        Restart the training from the provided path prefix of checkpoint files. (default: None)
  -f INIT_FRZ_MODEL, --init-frz-model INIT_FRZ_MODEL
                        Initialize the training from the frozen model. (default: None)
  -t FINETUNE, --finetune FINETUNE
                        Finetune the frozen pretrained model. (default: None)
  --use-pretrain-script
                        When performing fine-tuning or init-model, utilize the model parameters provided by the script of the pretrained model rather than relying on user input. It is important to note that in TensorFlow, this behavior is the default and cannot be modified for fine-tuning.  (default: False)
  -o OUTPUT, --output OUTPUT
                        The output file of the parameters used in training. (default: out.json)
  --skip-neighbor-stat  Skip calculating neighbor statistics. Sel checking, automatic sel, and model compression will be disabled. (default: False)
  --model-branch MODEL_BRANCH
                        (Supported backend: PyTorch) Model branch chosen for fine-tuning if multi-task. If not specified, it will re-init the fitting net. (default: )
  --force-load          (Supported backend: PyTorch) Force load from ckpt, other missing tensors will init from scratch (default: False)

examples:
    dp train input.json
    dp train input.json --restart model.ckpt
    dp train input.json --init-model model.ckpt

--init-model model.ckpt, initializes the model training with an existing model that is stored in the path prefix of checkpoint files model.ckpt, the network architectures should match.

--restart model.ckpt, continues the training from the checkpoint model.ckpt.

--init-frz-model frozen_model.pb, initializes the training with an existing model that is stored in frozen_model.pb.

--skip-neighbor-stat will skip calculating neighbor statistics if one is concerned about performance. Some features will be disabled.

To maximize the performance, one should follow FAQ: How to control the parallelism of a job to control the number of threads. See Runtime environment variables for all runtime environment variables.

5.2.5. Adjust `sel` of a frozen model #

One can use --init-frz-model features to adjust (increase or decrease) sel of an existing model. Firstly, one needs to adjust sel in input.json. For example, adjust from [46, 92] to [23, 46].

"model": {
    "descriptor": {
        "sel": [23, 46]
    }
}

To obtain the new model at once, numb_steps should be set to zero:

"training": {
    "numb_steps": 0
}

Then, one can initialize the training from the frozen model and freeze the new model at once:

dp train input.json --init-frz-model frozen_model.pb
dp freeze -o frozen_model_adjusted_sel.pb

Two models should give the same result when the input satisfies both constraints.

Note: At this time, this feature is only supported by se_e2_a descriptor with set_davg_true enabled, or hybrid composed of the above descriptors.