Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files. More...

Collaboration diagram for Chunking Utilities:

Namespaces
namespace	Utopia::DataIO::_chunk_helpers

Functions
template<typename Cont = std::vector< hsize_t >>
const Cont	Utopia::DataIO::calc_chunksize (const hsize_t typesize, const Cont io_extend, Cont max_extend={}, const bool opt_inf_dims=true, const bool larger_high_dims=true, const unsigned int CHUNKSIZE_MAX=1048576, const unsigned int CHUNKSIZE_MIN=8192, const unsigned int CHUNKSIZE_BASE=262144)
	Try to guess a good chunksize for a dataset.

Detailed Description

Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files.

Algorithms for optimizing chunk size

General idea

The general idea of these algorithms is that in order for I/O operations to be fast, a reasonable chunk size needs to be given. Given the information known about the data to be written, an algorithm should automatically determine an optimal size for the chunks. What is optimal in the case of HDF5? Two main factors determine the speed of I/O operations in HDF5: the number of chunk lookups necessary and the size of the chunks. If either of the two is too large, performance suffers. To that end, these algorithms try to make the chunks as large as possible while staying below an upper limit, CHUNKSIZE_MAX, which – per default – corresponds to the default size of the HDF5 chunk cache.

Note that the algorithms prioritize single I/O operations, such that writing is easy. Depending on the shape of your data and how you want to read it, this might not be ideal. For those cases, it might be more reasonable to specify the chunk sizes manually.

Implementation

The implementation is done via a main handler method, guess_chunksize and two helper methods, which implement the algorithms. The main method checks arguments and determines which algorithms can and need be applied. The helper methods then carry out the optimization, working on a common chunks container.

Function Documentation

◆ calc_chunksize()

template<typename Cont = std::vector< hsize_t >>

const Cont Utopia::DataIO::calc_chunksize	(	const hsize_t	typesize,
		const Cont	io_extend,
		Cont	max_extend = `{}`,
		const bool	opt_inf_dims = `true`,
		const bool	larger_high_dims = `true`,
		const unsigned int	CHUNKSIZE_MAX = `1048576`,
		const unsigned int	CHUNKSIZE_MIN = `8192`,
		const unsigned int	CHUNKSIZE_BASE = `262144`
	)

Try to guess a good chunksize for a dataset.

The premise is that a single write operation should be as fast as possible, i.e. that it occurs within one chunk. Also, if a maximum dataset extend is known, it is taken into account to determine more favourable chunk sizes.

Parameters

typesize	The size of each element in bytes
io_extend	The extend of one I/O operation. The rank of the dataset is extracted from this argument. The algorithm is written to make an I/O operation of this extend use as few chunks as possible.
max_extend	The maximum extend the dataset can have. If given, the chunk size is increased along the open dims to spread evenly and fill the max_extend as best as as possible. If not given, the max_extend will be assumed to be the same as the io_extend.
opt_inf_dims	Whether to optimize unlimited dimensions or not. If set, and there is still room left to optimize after the finite dimensions have been extended, the chunks in the unlimited dimensions are extended as far as possible.
larger_high_dims	If set, dimensions with higher indices are favourable enlarged and less favourable reduced. This can be useful if it is desired to keep these dimensions together, e.g. because they are written close to each other (e.g., as inner part of a loop)
CHUNKSIZE_MAX	Largest chunksize; should not exceed 1MiB too much, or, more precisely: should fit into the chunk cache which (by default) is 1MiB large
CHUNKSIZE_MIN	smallest chunksize; should be above a few KiB
CHUNKSIZE_BASE	base factor for the target chunksize (in bytes) if the max_extend is unlimited in all dimensions and opt_inf_dims is activated. This value is not used in any other scenario.

Template Parameters

Cont	The type of the container holding the io_extend, max_extend, and the returned chunks. If none is given, defaults to the largest possible, i.e. a std::vector of hsize_t elements.

                                                     {},
               const bool         opt_inf_dims     = true,
               const bool         larger_high_dims = true,
               const unsigned int CHUNKSIZE_MAX    = 1048576, // 1M
               const unsigned int CHUNKSIZE_MIN    = 8192,    // 8k
               const unsigned int CHUNKSIZE_BASE   = 262144)    // 256k
{
    // Make the helper functions available
    using namespace _chunk_helpers;
 
    // Helper lambda for calculating bytesize of a chunks configuration
    auto bytes = [&typesize](Cont c) {
        return typesize *
               std::accumulate(c.begin(), c.end(), 1, std::multiplies<>());
    };
 
    // Get a logger to use here; note that it needs to have been set up outside
    // of here beforehand!
    const auto log = spdlog::get("data_io");
 
    // .. Check correctness of arguments and extract some info ................
    // Get the rank
    unsigned short rank = io_extend.size();
 
    // For scalar datasets, chunking is not available
    if (rank == 0)
    {
        throw std::invalid_argument("Cannot guess chunksize for a scalar "
                                    "dataset!");
    }
 
    // Make sure io_extend has no illegal values (<=0)
    for (const auto& val : io_extend)
    {
        if (val <= 0)
        {
            throw std::invalid_argument(
                "Argument 'io_extend' contained "
                "illegal (zero or negative) value(s)! io_extend: " +
                to_str(io_extend));
        }
    }
 
    // Find out if the max_extend is given and determine whether dset is finite
    bool dset_finite;
    bool all_dims_inf;
 
    if (max_extend.size())
    {
        // Yes, was given. Need to check that the max_extend values are ok.
        // Check that it matches the rank
        if (max_extend.size() != rank)
        {
            throw std::invalid_argument(
                "Argument 'max_extend' does not have the same dimensionality "
                "as the rank of this dataset (as extracted from the "
                "'io_extend' argument).");
        }
 
        // And that all values are valid, i.e. larger than corresp.io_extend
        for (unsigned short i = 0; i < rank; i++)
        {
            if (max_extend[i] < io_extend[i])
            {
                throw std::invalid_argument(
                    "Index " + std::to_string(i) +
                    " of argument 'max_extend' (" + to_str(max_extend) +
                    ") was smaller than the corresponding 'io_extend' (" +
                    to_str(io_extend) + ") value! ");
            }
        }
        // max_extend content is valid now
 
        // Now extract information on the properties of max_extend
        // Need to check whether any dataset dimension can be infinitely long
        dset_finite = (std::find(max_extend.begin(),
                                 max_extend.end(),
                                 H5S_UNLIMITED) ==
                       max_extend.end()); // i.e., H5S_UNLIMITED _not_ found
 
        // Or even all are infinitely long
        all_dims_inf = true;
        for (const auto& ext : max_extend)
        {
            if (ext < H5S_UNLIMITED)
            {
                // This one is not infinite
                all_dims_inf = false;
                break;
            }
        }
    }
    else
    {
        // max_extend not given
        // Have to assume the max_extend is the same as the io_extend
        // Thus, the properties are known:
        dset_finite  = true;
        all_dims_inf = false;
 
        // Set the values to those of io_extend
        max_extend.insert(
            max_extend.begin(), io_extend.begin(), io_extend.end());
    }
 
    // NOTE max_extend is now a vector of same rank as io_extend
    log->info("Calculating optimal chunk size for io_extend {} and "
              "max_extend {} ...",
              to_str(io_extend),
              to_str(max_extend));
    log->debug("rank:                {}", rank);
    log->debug("finite dset?         {}", dset_finite);
    log->debug("all dims infinite?   {}", all_dims_inf);
    log->debug("optimize inf dims?   {}", opt_inf_dims);
    log->debug("larger high dims?    {}", larger_high_dims);
    log->debug("typesize:            {}", typesize);
    log->debug("max. chunksize:      {:7d} ({:.1f} kiB)",
               CHUNKSIZE_MAX,
               CHUNKSIZE_MAX / 1024.);
    log->debug("min. chunksize:      {:7d} ({:.1f} kiB)",
               CHUNKSIZE_MIN,
               CHUNKSIZE_MIN / 1024.);
    log->debug("base chunksize:      {:7d} ({:.1f} kiB)",
               CHUNKSIZE_BASE,
               CHUNKSIZE_BASE / 1024.);
 
    // .. For the simple cases, evaluate the chunksize directly ...............
 
    // For large typesizes, each chunk can at most contain a single element.
    // Chunks that extend to more than one element require a typesize smaller
    // than half the maximum chunksize.
    if (typesize > CHUNKSIZE_MAX / 2)
    {
        log->debug("Type size >= 1/2 max. chunksize -> Each cell needs to be "
                   "its own chunk.");
        return Cont(rank, 1);
    }
 
    // For a finite dataset, that would fit into CHUNKSIZE_MAX when maximally
    // extended, we can only have (and only need!) a single chunk
    if (dset_finite && (bytes(max_extend) <= CHUNKSIZE_MAX))
    {
        log->debug("Maximally extended dataset will fit into single chunk.");
        return Cont(max_extend);
    }
 
    // .. Step 1: Optimize for one I/O operation fitting into chunk ...........
    log->debug("Cannot apply simple optimizations. Try to fit single I/O "
               "operation into a chunk ...");
 
    // Create the temporary container that will store the chunksize values.
    // It starts with a copy of the extend values for I/O operations.
    Cont _chunks(io_extend);
 
    // Determine the size (in bytes) of a write operation with this extend
    const auto bytes_io = bytes(io_extend);
    log->debug(
        "I/O operation size:  {:7d} ({:.1f} kiB)", bytes_io, bytes_io / 1024.);
 
    // Determine if an I/O operation fits into a single chunk, then decide on
    // how to optimize accordingly
    if (bytes_io > CHUNKSIZE_MAX)
    {
        // The I/O operation does _not_ fit into a chunk
        // Aim to fit the I/O operation into the chunk -> target: max chunksize
        log->debug("Single I/O operation does not fit into chunk.");
        log->debug("Trying to use the fewest possible chunks for a single "
                   "I/O operation ...");
 
        opt_chunks_target(_chunks,
                          CHUNKSIZE_MAX, // <- target value
                          typesize,
                          CHUNKSIZE_MAX,
                          CHUNKSIZE_MIN,
                          larger_high_dims,
                          log);
        // NOTE The algorithm is also able to _increase_ the chunk size in
        //      certain dimensions. However, with _chunks == io_extend and the
        //      knowledge that the current bytesize of _chunks is above the
        //      maximum size, the chunk extensions will only be _reduced_.
    }
    else if (all_dims_inf && opt_inf_dims && bytes(_chunks) < CHUNKSIZE_BASE)
    {
        // The I/O operation _does_ fit into a chunk, but the dataset is
        // infinite in _all directions_ and small chunksizes can be very
        // inefficient -> optimize towards some base value
        log->debug("Single I/O operation does fit into chunk.");
        log->debug("Optimizing chunks in unlimited dimensions to be closer "
                   "to base chunksize ...");
 
        opt_chunks_target(_chunks,
                          CHUNKSIZE_BASE, // <- target value
                          typesize,
                          CHUNKSIZE_MAX,
                          CHUNKSIZE_MIN,
                          larger_high_dims,
                          log);
        // NOTE There is no issue with going beyond the maximum chunksize here
    }
    else
    {
        // no other optimization towards a target size make sense
        log->debug("Single I/O operation does fit into a chunk.");
    }
 
    // To be on the safe side: Check that _chunks did not exceed max_extend
    for (unsigned short i = 0; i < rank; i++)
    {
        if (_chunks[i] > max_extend[i])
        {
            log->warn("Optimization led to chunks larger than max_extend. "
                      "This should not have happened!");
            _chunks[i] = max_extend[i];
        }
    }
 
    // .. Step 2: Optimize by taking the max_extend into account ..............
 
    // This is only possible if the current chunk size is not already above the
    // upper limit, CHUNKSIZE_MAX, and the max_extend is not already reached.
    // Also, it should not be enabled if the optimization towards unlimited
    // dimensions was already performed
    if (!(opt_inf_dims && all_dims_inf) && (_chunks != max_extend) &&
        (bytes(_chunks) < CHUNKSIZE_MAX))
    {
        log->debug("Have max_extend information and can (potentially) use it "
                   "to optimize chunk extensions.");
 
        opt_chunks_with_max_extend(_chunks,
                                   max_extend,
                                   typesize,
                                   CHUNKSIZE_MAX,
                                   opt_inf_dims,
                                   larger_high_dims,
                                   log);
    }
    // else: no further optimization possible
 
    // Done.
    // Make sure that chunksize is smaller than maximum chunksize
    if (bytes(_chunks) > CHUNKSIZE_MAX)
    {
        throw std::runtime_error(
            "Byte size of chunks " + to_str(_chunks) +
            " is larger than CHUNKSIZE_MAX! This should not have happened!");
    }
 
    // Create a const version of the temporary chunks vector
    const Cont chunks(_chunks);
    log->info("Optimized chunk size:  {}", to_str(chunks));
 
    return chunks;
}