Utopia  2
Framework for studying models of complex & adaptive systems.
Functions
Chunking Utilities

Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files. More...

Collaboration diagram for Chunking Utilities:

Functions

template<typename Cont = std::vector< hsize_t >>
const Cont Utopia::DataIO::calc_chunksize (const hsize_t typesize, const Cont io_extend, Cont max_extend={}, const bool opt_inf_dims=true, const bool larger_high_dims=true, const unsigned int CHUNKSIZE_MAX=1048576, const unsigned int CHUNKSIZE_MIN=8192, const unsigned int CHUNKSIZE_BASE=262144)
 Try to guess a good chunksize for a dataset. More...
 

Detailed Description

Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files.

Algorithms for optimizing chunk size

General idea

The general idea of these algorithms is that in order for I/O operations to be fast, a reasonable chunk size needs to be given. Given the information known about the data to be written, an algorithm should automatically determine an optimal size for the chunks. What is optimal in the case of HDF5? Two main factors determine the speed of I/O operations in HDF5: the number of chunk lookups necessary and the size of the chunks. If either of the two is too large, performance suffers. To that end, these algorithms try to make the chunks as large as possible while staying below an upper limit, CHUNKSIZE_MAX, which – per default – corresponds to the default size of the HDF5 chunk cache.

Note that the algorithms prioritize single I/O operations, such that writing is easy. Depending on the shape of your data and how you want to read it, this might not be ideal. For those cases, it might be more reasonable to specify the chunk sizes manually.

Implementation

The implementation is done via a main handler method, guess_chunksize and two helper methods, which implement the algorithms. The main method checks arguments and determines which algorithms can and need be applied. The helper methods then carry out the optimization, working on a common chunks container.

Function Documentation

◆ calc_chunksize()

template<typename Cont = std::vector< hsize_t >>
const Cont Utopia::DataIO::calc_chunksize ( const hsize_t  typesize,
const Cont  io_extend,
Cont  max_extend = {},
const bool  opt_inf_dims = true,
const bool  larger_high_dims = true,
const unsigned int  CHUNKSIZE_MAX = 1048576,
const unsigned int  CHUNKSIZE_MIN = 8192,
const unsigned int  CHUNKSIZE_BASE = 262144 
)

Try to guess a good chunksize for a dataset.

The premise is that a single write operation should be as fast as possible, i.e. that it occurs within one chunk. Also, if a maximum dataset extend is known, it is taken into account to determine more favourable chunk sizes.

Parameters
typesizeThe size of each element in bytes
io_extendThe extend of one I/O operation. The rank of the dataset is extracted from this argument. The algorithm is written to make an I/O operation of this extend use as few chunks as possible.
max_extendThe maximum extend the dataset can have. If given, the chunk size is increased along the open dims to spread evenly and fill the max_extend as best as as possible. If not given, the max_extend will be assumed to be the same as the io_extend.
opt_inf_dimsWhether to optimize unlimited dimensions or not. If set, and there is still room left to optimize after the finite dimensions have been extended, the chunks in the unlimited dimensions are extended as far as possible.
larger_high_dimsIf set, dimensions with higher indices are favourable enlarged and less favourable reduced. This can be useful if it is desired to keep these dimensions together, e.g. because they are written close to each other (e.g., as inner part of a loop)
CHUNKSIZE_MAXLargest chunksize; should not exceed 1MiB too much, or, more precisely: should fit into the chunk cache which (by default) is 1MiB large
CHUNKSIZE_MINsmallest chunksize; should be above a few KiB
CHUNKSIZE_BASEbase factor for the target chunksize (in bytes) if the max_extend is unlimited in all dimensions and opt_inf_dims is activated. This value is not used in any other scenario.
Template Parameters
ContThe type of the container holding the io_extend, max_extend, and the returned chunks. If none is given, defaults to the largest possible, i.e. a std::vector of hsize_t elements.