2.2. Formats of a system
Two binary formats, NumPy and HDF5, are supported for training. The raw format is not directly supported, but a tool is provided to convert data from the raw format to the NumPy format.
2.2.1. NumPy format
In a system with the Numpy format, the system properties are stored as text files ending with
.raw, such as
type_map.raw, under the system directory. If one needs to train a non-periodic system, an empty
nopbc file should be put under the system directory. Both input and labeled frame properties are saved as the NumPy binary data (NPY) files ending with
.npy in each of the
set.* directories. Take an example, a system may contain the following files:
We assume that the atom types do not change in all frames. It is provided by
type.raw, which has one line with the types of atoms written one by one. The atom types should be integers. For example the
type.raw of a system that has 2 atoms with 0 and 1:
$ cat type.raw
Sometimes one needs to map the integer types to atom names. The mapping can be given by the file
type_map.raw. For example
$ cat type_map.raw
0 is named by
"O" and the type
1 is named by
For training models with descriptor
se_atten, a new system format is supported to put together the frame-sparse systems with the same atom number.
2.2.2. HDF5 format
A system with the HDF5 format has the same structure as the Numpy format, but in an HDF5 file, a system is organized as an HDF5 group. The file name of a Numpy file is the key in an HDF5 file, and the data is the value of the key. One needs to use
# in a DP path to divide the path to the HDF5 file and the HDF5 path:
/path/to/data.hdf5 is the file path and
/H2O is the HDF5 path. All HDF5 paths should start with
/. There should be some data in the
H2O group, such as
An HDF5 file with a large number of systems has better performance than multiple NumPy files in a large cluster.
2.2.3. Raw format and data conversion
A raw file is a plain text file with each information item written in one file and one frame written on one line. It’s not directly supported, but we provide a tool to convert them.
In the raw format, the property of one frame is provided per line, ending with
.raw. Take an example, the default files that provide box, coordinate, force, energy and virial are
virial.raw, respectively. Here is an example of
$ cat force.raw
-0.724 2.039 -0.951 0.841 -0.464 0.363
6.737 1.554 -5.587 -2.803 0.062 2.222
-1.968 -0.163 1.020 -0.225 -0.789 0.343
force.raw contains 3 frames with each frame having the forces of 2 atoms, thus it has 3 lines and 6 columns. Each line provides all the 3 force components of 2 atoms in 1 frame. The first three numbers are the 3 force components of the first atom, while the second three numbers are the 3 force components of the second atom. Other files are organized similarly. The number of lines of all raw files should be identical.
One can use the script
$deepmd_source_dir/data/raw/raw_to_set.sh to convert the prepared raw files to the NumPy format. For example, if we have a raw file that contains 6000 frames,
box.raw coord.raw energy.raw force.raw type.raw virial.raw
$ $deepmd_source_dir/data/raw/raw_to_set.sh 2000
nframe is 6000
nline per set is 2000
will make 3 sets
making set 0 ...
making set 1 ...
making set 2 ...
box.raw coord.raw energy.raw force.raw set.000 set.001 set.002 type.raw virial.raw
It generates three sets
set.002, with each set containing 2000 frames in the Numpy format.