Utopia 2
Framework for studying models of complex & adaptive systems.
|
Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files. More...
Namespaces | |
namespace | Utopia::DataIO::_chunk_helpers |
Functions | |
template<typename Cont = std::vector< hsize_t >> | |
const Cont | Utopia::DataIO::calc_chunksize (const hsize_t typesize, const Cont io_extend, Cont max_extend={}, const bool opt_inf_dims=true, const bool larger_high_dims=true, const unsigned int CHUNKSIZE_MAX=1048576, const unsigned int CHUNKSIZE_MIN=8192, const unsigned int CHUNKSIZE_BASE=262144) |
Try to guess a good chunksize for a dataset. | |
Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files.
The general idea of these algorithms is that in order for I/O operations to be fast, a reasonable chunk size needs to be given. Given the information known about the data to be written, an algorithm should automatically determine an optimal size for the chunks. What is optimal in the case of HDF5? Two main factors determine the speed of I/O operations in HDF5: the number of chunk lookups necessary and the size of the chunks. If either of the two is too large, performance suffers. To that end, these algorithms try to make the chunks as large as possible while staying below an upper limit, CHUNKSIZE_MAX, which – per default – corresponds to the default size of the HDF5 chunk cache.
Note that the algorithms prioritize single I/O operations, such that writing is easy. Depending on the shape of your data and how you want to read it, this might not be ideal. For those cases, it might be more reasonable to specify the chunk sizes manually.
The implementation is done via a main handler method, guess_chunksize
and two helper methods, which implement the algorithms. The main method checks arguments and determines which algorithms can and need be applied. The helper methods then carry out the optimization, working on a common chunks
container.
const Cont Utopia::DataIO::calc_chunksize | ( | const hsize_t | typesize, |
const Cont | io_extend, | ||
Cont | max_extend = {} , |
||
const bool | opt_inf_dims = true , |
||
const bool | larger_high_dims = true , |
||
const unsigned int | CHUNKSIZE_MAX = 1048576 , |
||
const unsigned int | CHUNKSIZE_MIN = 8192 , |
||
const unsigned int | CHUNKSIZE_BASE = 262144 |
||
) |
Try to guess a good chunksize for a dataset.
The premise is that a single write operation should be as fast as possible, i.e. that it occurs within one chunk. Also, if a maximum dataset extend is known, it is taken into account to determine more favourable chunk sizes.
typesize | The size of each element in bytes |
io_extend | The extend of one I/O operation. The rank of the dataset is extracted from this argument. The algorithm is written to make an I/O operation of this extend use as few chunks as possible. |
max_extend | The maximum extend the dataset can have. If given, the chunk size is increased along the open dims to spread evenly and fill the max_extend as best as as possible. If not given, the max_extend will be assumed to be the same as the io_extend. |
opt_inf_dims | Whether to optimize unlimited dimensions or not. If set, and there is still room left to optimize after the finite dimensions have been extended, the chunks in the unlimited dimensions are extended as far as possible. |
larger_high_dims | If set, dimensions with higher indices are favourable enlarged and less favourable reduced. This can be useful if it is desired to keep these dimensions together, e.g. because they are written close to each other (e.g., as inner part of a loop) |
CHUNKSIZE_MAX | Largest chunksize; should not exceed 1MiB too much, or, more precisely: should fit into the chunk cache which (by default) is 1MiB large |
CHUNKSIZE_MIN | smallest chunksize; should be above a few KiB |
CHUNKSIZE_BASE | base factor for the target chunksize (in bytes) if the max_extend is unlimited in all dimensions and opt_inf_dims is activated. This value is not used in any other scenario. |
Cont | The type of the container holding the io_extend, max_extend, and the returned chunks. If none is given, defaults to the largest possible, i.e. a std::vector of hsize_t elements. |