Learning rate

5.3. Learning rate#

DeePMD-kit supports three learning rate schedules:

exp: Exponential decay with optional stepped or smooth mode
cosine: Cosine annealing for smooth decay curve
wsd: Warmup-stable-decay with configurable final decay rule

All schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target start_lr.

This page focuses on schedule behavior, examples, and formulas. For the canonical argument definitions, see learning_rate.

5.3.1. Quick Start#

5.3.1.1. Exponential decay (default)#

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_steps": 5000
}

5.3.1.2. Cosine annealing#

"learning_rate": {
    "type": "cosine",
    "start_lr": 0.001,
    "stop_lr": 1e-6
}

5.3.1.3. Warmup-stable-decay#

"learning_rate": {
    "type": "wsd",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_phase_ratio": 0.1
}

5.3.2. Common parameters#

Use learning_rate as the canonical parameter reference. This page only highlights the argument combinations that matter when choosing a schedule:

Shared by exp, cosine, and wsd: start_lr plus exactly one of stop_lr or stop_lr_ratio.
Optional warmup for all schedules: warmup_steps or warmup_ratio, with optional warmup_start_factor.
Optional distributed scaling for all schedules: scale_by_worker.
Additional options for exp: decay_steps, decay_rate, and smooth.
Additional options for wsd: decay_phase_ratio and decay_type.
cosine has no extra schedule-specific arguments beyond the shared ones.

See Mathematical Theory for complete formulas.

5.3.3. Exponential Decay Schedule#

The exponential decay schedule reduces the learning rate exponentially over training steps. It is the default schedule when type is omitted.

5.3.3.1. Stepped vs smooth mode#

By setting smooth to true, the learning rate decays smoothly at every step instead of in a stepped manner. This provides a more gradual decay curve similar to PyTorch’s ExponentialLR, whereas the default stepped mode (smooth: false) is similar to PyTorch’s StepLR.

If decay_rate is not explicitly provided, DeePMD-kit computes it from start_lr and the requested final learning rate so that the schedule reaches the target by numb_steps. The exact expression is given in Mathematical Theory.

5.3.3.2. Examples#

Basic exponential decay without warmup:

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_steps": 5000
}

Using stop_lr_ratio:

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr_ratio": 1e-3,
    "decay_steps": 5000
}

Equivalent to stop_lr: 1e-6 (i.e., 0.001 * 1e-3).

With warmup (using warmup_steps):

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_steps": 5000,
    "warmup_steps": 10000,
    "warmup_start_factor": 0.1
}

Learning rate starts from 0.0001 (i.e., warmup_start_factor * start_lr), increases linearly to start_lr over 10,000 steps, then decays exponentially.

With warmup (using warmup_ratio):

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr_ratio": 1e-3,
    "decay_steps": 5000,
    "warmup_ratio": 0.05
}

If numb_steps is 1,000,000, warmup lasts 50,000 steps. Learning rate starts from 0.0 (default warmup_start_factor) and increases to start_lr.

Smooth exponential decay (with smooth):

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_steps": 5000,
    "smooth": true
}

With smooth set to true, the learning rate decays continuously at every step, similar to PyTorch’s ExponentialLR. The default stepped mode (smooth: false) is similar to PyTorch’s StepLR.

5.3.4. Cosine Annealing Schedule#

The cosine annealing schedule smoothly decreases the learning rate following a cosine curve. It often provides better convergence than exponential decay.

After warmup, the learning rate follows a cosine curve from start_lr to stop_lr or the value implied by stop_lr_ratio. The exact expression is given in Mathematical Theory.

5.3.4.1. Examples#

Basic cosine annealing:

"learning_rate": {
    "type": "cosine",
    "start_lr": 0.001,
    "stop_lr": 1e-6
}

Using stop_lr_ratio:

"learning_rate": {
    "type": "cosine",
    "start_lr": 0.001,
    "stop_lr_ratio": 1e-3
}

With warmup:

"learning_rate": {
    "type": "cosine",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "warmup_steps": 5000,
    "warmup_start_factor": 0.0
}

5.3.5. Warmup-Stable-Decay Schedule#

The warmup-stable-decay (wsd) schedule keeps the learning rate at start_lr for most of the post-warmup training steps and then applies a shorter final decay phase.

The length of the final decay phase is controlled by decay_phase_ratio. The remaining post-warmup steps form the stable phase. The decay rule is selected by decay_type, which supports inverse_linear (default), cosine, and linear.

5.3.5.1. Examples#

Basic WSD with default inverse-linear decay:

"learning_rate": {
    "type": "wsd",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_phase_ratio": 0.1
}

This configuration uses a stable phase for most of the post-warmup training and reserves the final 10% of total training steps for the decay phase.

Using stop_lr_ratio:

"learning_rate": {
    "type": "wsd",
    "start_lr": 0.001,
    "stop_lr_ratio": 1e-3,
    "decay_phase_ratio": 0.1
}

Equivalent to stop_lr: 1e-6 (i.e., 0.001 * 1e-3).

With warmup:

"learning_rate": {
    "type": "wsd",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_phase_ratio": 0.1,
    "warmup_steps": 5000,
    "warmup_start_factor": 0.0
}

Warmup first increases the learning rate to start_lr. After warmup, the schedule enters the stable phase and finally decays during the last WSD decay phase.

WSD with cosine decay phase:

"learning_rate": {
    "type": "wsd",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_phase_ratio": 0.1,
    "decay_type": "cosine"
}

WSD with linear decay phase:

"learning_rate": {
    "type": "wsd",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_phase_ratio": 0.1,
    "decay_type": "linear"
}

5.3.6. Warmup Mechanism#

Warmup is a technique to stabilize training in early stages by gradually increasing the learning rate from warmup_start_factor * start_lr to start_lr.

You can specify warmup duration using either warmup_steps (absolute) or warmup_ratio (relative to numb_steps). These are mutually exclusive.

The exact piecewise warmup formula is given in Mathematical Theory.

5.3.7. Mathematical Theory#

5.3.7.1. Notation#

Symbol	Description
\(\tau\)	Global step index (0-indexed)
\(\tau^{\text{warmup}}\)	Number of warmup steps
\(\tau^{\text{decay}}\)	Number of decay steps = `numb_steps - warmup_steps`
\(\gamma^0\)	start_lr: Learning rate at start of decay phase
\(\gamma^{\text{stop}}\)	stop_lr: Learning rate at end of training
\(f^{\text{warmup}}\)	warmup_start_factor: Initial warmup LR factor
\(s\)	decay_steps: Decay period for exponential schedule
\(r\)	decay_rate: Decay rate for exponential schedule
\(\rho^{\text{wsd}}\)	decay_phase_ratio: Ratio of WSD decay phase
\(\tau^{\text{wsd}}\)	Number of WSD decay-phase steps
\(\tau^{\text{stable}}\)	Number of WSD stable-phase steps
\(\hat{\tau}\)	Normalized progress within the WSD decay phase

5.3.7.2. Complete warmup formula#

For steps \(0 \leq \tau < \tau^{\text{warmup}}\):

\[\gamma(\tau) = f^{\text{warmup}} \cdot \gamma^0 + \frac{(1 - f^{\text{warmup}}) \cdot \gamma^0}{\tau^{\text{warmup}}} \cdot \tau\]

5.3.7.3. Exponential decay (stepped mode)#

For steps \(\tau \geq \tau^{\text{warmup}}\):

\[\gamma(\tau) = \gamma^0 \cdot r^{\left\lfloor \frac{\tau - \tau^{\text{warmup}}}{s} \right\rfloor}\]

where the decay rate \(r\) is:

\[r = \left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right)^{\frac{s}{\tau^{\text{decay}}}}\]

5.3.7.4. Exponential decay (smooth mode)#

For steps \(\tau \geq \tau^{\text{warmup}}\):

\[\gamma(\tau) = \gamma^0 \cdot r^{\frac{\tau - \tau^{\text{warmup}}}{s}}\]

5.3.7.5. Cosine annealing#

For steps \(\tau \geq \tau^{\text{warmup}}\):

\[\gamma(\tau) = \gamma^{\text{stop}} + \frac{\gamma^0 - \gamma^{\text{stop}}}{2} \left(1 + \cos\left(\frac{\pi \cdot (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right)\]

Equivalently, using \(\alpha = \gamma^{\text{stop}} / \gamma^0\):

\[\gamma(\tau) = \gamma^0 \cdot \left[\alpha + \frac{1 - \alpha}{2}\left(1 + \cos\left(\frac{\pi \cdot (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right)\right]\]

5.3.7.6. Warmup-stable-decay#

For WSD, define the final decay-phase length as:

\[\tau^{\text{wsd}} = \left\lfloor \rho^{\text{wsd}} \cdot \tau^{\text{stop}} \right\rfloor\]

and the stable-phase length as:

\[\tau^{\text{stable}} = \tau^{\text{decay}} - \tau^{\text{wsd}}\]

For steps in the stable phase,

\[\gamma(\tau) = \gamma^0, \qquad \tau^{\text{warmup}} \leq \tau < \tau^{\text{warmup}} + \tau^{\text{stable}}\]

For steps in the final decay phase, define the normalized decay progress:

\[\hat{\tau} = \frac{ \tau - \tau^{\text{warmup}} - \tau^{\text{stable}} }{ \tau^{\text{wsd}} }\]

Then the decay-phase formulas are:

Inverse-linear decay (decay_type: "inverse_linear"):

\[\gamma(\tau) = \frac{1}{ \hat{\tau} / \gamma^{\text{stop}} + (1 - \hat{\tau}) / \gamma^0 }\]

Cosine decay (decay_type: "cosine"):

\[\gamma(\tau) = \gamma^{\text{stop}} + \frac{\gamma^0 - \gamma^{\text{stop}}}{2} \left(1 + \cos\left(\pi \hat{\tau}\right)\right)\]

Linear decay (decay_type: "linear"):

\[\gamma(\tau) = \gamma^0 + \left(\gamma^{\text{stop}} - \gamma^0\right)\hat{\tau}\]

For steps beyond the end of the decay phase, the learning rate stays at \(\gamma^{\text{stop}}\).

5.3.8. Migration from versions before 3.1.3#

In version 3.1.2 and earlier, start_lr and stop_lr / stop_lr_ratio had default values and could be omitted. Starting from version 3.1.3, these parameters are required and must be explicitly specified.

Configuration in version 3.1.2:

"learning_rate": {
    "type": "exp",
    "decay_steps": 5000
}

Updated configuration (version 3.1.3+):

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_steps": 5000
}

Or using stop_lr_ratio:

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr_ratio": 1e-3,
    "decay_steps": 5000
}

5.3.9. References#

This section is built upon Jinzhe Zeng, Duo Zhang, Denghui Lu, Pinghui Mo, Zeyu Li, Yixiao Chen, Marián Rynik, Li’ang Huang, Ziyao Li, Shaochen Shi, Yingze Wang, Haotian Ye, Ping Tuo, Jiabin Yang, Ye Ding, Yifan Li, Davide Tisi, Qiyu Zeng, Han Bao, Yu Xia, Jiameng Huang, Koki Muraoka, Yibo Wang, Junhan Chang, Fengbo Yuan, Sigbjørn Løland Bore, Chun Cai, Yinnian Lin, Bo Wang, Jiayan Xu, Jia-Xin Zhu, Chenxing Luo, Yuzhi Zhang, Rhys E. A. Goodall, Wenshuo Liang, Anurag Kumar Singh, Sikai Yao, Jingchao Zhang, Renata Wentzcovitch, Jiequn Han, Jie Liu, Weile Jia, Darrin M. York, Weinan E, Roberto Car, Linfeng Zhang, Han Wang, J. Chem. Phys. 159, 054801 (2023) licensed under a Creative Commons Attribution (CC BY) license.

Learning rate

Contents

5.3. Learning rate#

5.3.1. Quick Start#

5.3.1.1. Exponential decay (default)#

5.3.1.2. Cosine annealing#

5.3.1.3. Warmup-stable-decay#

5.3.2. Common parameters#

5.3.3. Exponential Decay Schedule#

5.3.3.1. Stepped vs smooth mode#

5.3.3.2. Examples#

5.3.4. Cosine Annealing Schedule#

5.3.4.1. Examples#

5.3.5. Warmup-Stable-Decay Schedule#

5.3.5.1. Examples#

5.3.6. Warmup Mechanism#

5.3.7. Mathematical Theory#

5.3.7.1. Notation#

5.3.7.2. Complete warmup formula#

5.3.7.3. Exponential decay (stepped mode)#

5.3.7.4. Exponential decay (smooth mode)#

5.3.7.5. Cosine annealing#

5.3.7.6. Warmup-stable-decay#

5.3.8. Migration from versions before 3.1.3#

5.3.9. References#