Utopia 2
Framework for studying models of complex & adaptive systems.
Loading...
Searching...
No Matches
Namespaces | Functions
Chunking Utilities

Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files. More...

Collaboration diagram for Chunking Utilities:

Namespaces

namespace  Utopia::DataIO::_chunk_helpers
 

Functions

template<typename Cont = std::vector< hsize_t >>
const Cont Utopia::DataIO::calc_chunksize (const hsize_t typesize, const Cont io_extend, Cont max_extend={}, const bool opt_inf_dims=true, const bool larger_high_dims=true, const unsigned int CHUNKSIZE_MAX=1048576, const unsigned int CHUNKSIZE_MIN=8192, const unsigned int CHUNKSIZE_BASE=262144)
 Try to guess a good chunksize for a dataset.
 

Detailed Description

Provides algorithms for automatically optimize the chunksize in which data is written to the hard disk when writing compressed or extendable hdf5 files.

Algorithms for optimizing chunk size

General idea

The general idea of these algorithms is that in order for I/O operations to be fast, a reasonable chunk size needs to be given. Given the information known about the data to be written, an algorithm should automatically determine an optimal size for the chunks. What is optimal in the case of HDF5? Two main factors determine the speed of I/O operations in HDF5: the number of chunk lookups necessary and the size of the chunks. If either of the two is too large, performance suffers. To that end, these algorithms try to make the chunks as large as possible while staying below an upper limit, CHUNKSIZE_MAX, which – per default – corresponds to the default size of the HDF5 chunk cache.

Note that the algorithms prioritize single I/O operations, such that writing is easy. Depending on the shape of your data and how you want to read it, this might not be ideal. For those cases, it might be more reasonable to specify the chunk sizes manually.

Implementation

The implementation is done via a main handler method, guess_chunksize and two helper methods, which implement the algorithms. The main method checks arguments and determines which algorithms can and need be applied. The helper methods then carry out the optimization, working on a common chunks container.

Function Documentation

◆ calc_chunksize()

template<typename Cont = std::vector< hsize_t >>
const Cont Utopia::DataIO::calc_chunksize ( const hsize_t  typesize,
const Cont  io_extend,
Cont  max_extend = {},
const bool  opt_inf_dims = true,
const bool  larger_high_dims = true,
const unsigned int  CHUNKSIZE_MAX = 1048576,
const unsigned int  CHUNKSIZE_MIN = 8192,
const unsigned int  CHUNKSIZE_BASE = 262144 
)

Try to guess a good chunksize for a dataset.

The premise is that a single write operation should be as fast as possible, i.e. that it occurs within one chunk. Also, if a maximum dataset extend is known, it is taken into account to determine more favourable chunk sizes.

Parameters
typesizeThe size of each element in bytes
io_extendThe extend of one I/O operation. The rank of the dataset is extracted from this argument. The algorithm is written to make an I/O operation of this extend use as few chunks as possible.
max_extendThe maximum extend the dataset can have. If given, the chunk size is increased along the open dims to spread evenly and fill the max_extend as best as as possible. If not given, the max_extend will be assumed to be the same as the io_extend.
opt_inf_dimsWhether to optimize unlimited dimensions or not. If set, and there is still room left to optimize after the finite dimensions have been extended, the chunks in the unlimited dimensions are extended as far as possible.
larger_high_dimsIf set, dimensions with higher indices are favourable enlarged and less favourable reduced. This can be useful if it is desired to keep these dimensions together, e.g. because they are written close to each other (e.g., as inner part of a loop)
CHUNKSIZE_MAXLargest chunksize; should not exceed 1MiB too much, or, more precisely: should fit into the chunk cache which (by default) is 1MiB large
CHUNKSIZE_MINsmallest chunksize; should be above a few KiB
CHUNKSIZE_BASEbase factor for the target chunksize (in bytes) if the max_extend is unlimited in all dimensions and opt_inf_dims is activated. This value is not used in any other scenario.
Template Parameters
ContThe type of the container holding the io_extend, max_extend, and the returned chunks. If none is given, defaults to the largest possible, i.e. a std::vector of hsize_t elements.
606 {},
607 const bool opt_inf_dims = true,
608 const bool larger_high_dims = true,
609 const unsigned int CHUNKSIZE_MAX = 1048576, // 1M
610 const unsigned int CHUNKSIZE_MIN = 8192, // 8k
611 const unsigned int CHUNKSIZE_BASE = 262144) // 256k
612{
613 // Make the helper functions available
614 using namespace _chunk_helpers;
615
616 // Helper lambda for calculating bytesize of a chunks configuration
617 auto bytes = [&typesize](Cont c) {
618 return typesize *
619 std::accumulate(c.begin(), c.end(), 1, std::multiplies<>());
620 };
621
622 // Get a logger to use here; note that it needs to have been set up outside
623 // of here beforehand!
624 const auto log = spdlog::get("data_io");
625
626 // .. Check correctness of arguments and extract some info ................
627 // Get the rank
628 unsigned short rank = io_extend.size();
629
630 // For scalar datasets, chunking is not available
631 if (rank == 0)
632 {
633 throw std::invalid_argument("Cannot guess chunksize for a scalar "
634 "dataset!");
635 }
636
637 // Make sure io_extend has no illegal values (<=0)
638 for (const auto& val : io_extend)
639 {
640 if (val <= 0)
641 {
642 throw std::invalid_argument(
643 "Argument 'io_extend' contained "
644 "illegal (zero or negative) value(s)! io_extend: " +
645 to_str(io_extend));
646 }
647 }
648
649 // Find out if the max_extend is given and determine whether dset is finite
650 bool dset_finite;
651 bool all_dims_inf;
652
653 if (max_extend.size())
654 {
655 // Yes, was given. Need to check that the max_extend values are ok.
656 // Check that it matches the rank
657 if (max_extend.size() != rank)
658 {
659 throw std::invalid_argument(
660 "Argument 'max_extend' does not have the same dimensionality "
661 "as the rank of this dataset (as extracted from the "
662 "'io_extend' argument).");
663 }
664
665 // And that all values are valid, i.e. larger than corresp.io_extend
666 for (unsigned short i = 0; i < rank; i++)
667 {
668 if (max_extend[i] < io_extend[i])
669 {
670 throw std::invalid_argument(
671 "Index " + std::to_string(i) +
672 " of argument 'max_extend' (" + to_str(max_extend) +
673 ") was smaller than the corresponding 'io_extend' (" +
674 to_str(io_extend) + ") value! ");
675 }
676 }
677 // max_extend content is valid now
678
679 // Now extract information on the properties of max_extend
680 // Need to check whether any dataset dimension can be infinitely long
681 dset_finite = (std::find(max_extend.begin(),
682 max_extend.end(),
683 H5S_UNLIMITED) ==
684 max_extend.end()); // i.e., H5S_UNLIMITED _not_ found
685
686 // Or even all are infinitely long
687 all_dims_inf = true;
688 for (const auto& ext : max_extend)
689 {
690 if (ext < H5S_UNLIMITED)
691 {
692 // This one is not infinite
693 all_dims_inf = false;
694 break;
695 }
696 }
697 }
698 else
699 {
700 // max_extend not given
701 // Have to assume the max_extend is the same as the io_extend
702 // Thus, the properties are known:
703 dset_finite = true;
704 all_dims_inf = false;
705
706 // Set the values to those of io_extend
707 max_extend.insert(
708 max_extend.begin(), io_extend.begin(), io_extend.end());
709 }
710
711 // NOTE max_extend is now a vector of same rank as io_extend
712 log->info("Calculating optimal chunk size for io_extend {} and "
713 "max_extend {} ...",
714 to_str(io_extend),
715 to_str(max_extend));
716 log->debug("rank: {}", rank);
717 log->debug("finite dset? {}", dset_finite);
718 log->debug("all dims infinite? {}", all_dims_inf);
719 log->debug("optimize inf dims? {}", opt_inf_dims);
720 log->debug("larger high dims? {}", larger_high_dims);
721 log->debug("typesize: {}", typesize);
722 log->debug("max. chunksize: {:7d} ({:.1f} kiB)",
723 CHUNKSIZE_MAX,
724 CHUNKSIZE_MAX / 1024.);
725 log->debug("min. chunksize: {:7d} ({:.1f} kiB)",
726 CHUNKSIZE_MIN,
727 CHUNKSIZE_MIN / 1024.);
728 log->debug("base chunksize: {:7d} ({:.1f} kiB)",
729 CHUNKSIZE_BASE,
730 CHUNKSIZE_BASE / 1024.);
731
732 // .. For the simple cases, evaluate the chunksize directly ...............
733
734 // For large typesizes, each chunk can at most contain a single element.
735 // Chunks that extend to more than one element require a typesize smaller
736 // than half the maximum chunksize.
737 if (typesize > CHUNKSIZE_MAX / 2)
738 {
739 log->debug("Type size >= 1/2 max. chunksize -> Each cell needs to be "
740 "its own chunk.");
741 return Cont(rank, 1);
742 }
743
744 // For a finite dataset, that would fit into CHUNKSIZE_MAX when maximally
745 // extended, we can only have (and only need!) a single chunk
746 if (dset_finite && (bytes(max_extend) <= CHUNKSIZE_MAX))
747 {
748 log->debug("Maximally extended dataset will fit into single chunk.");
749 return Cont(max_extend);
750 }
751
752 // .. Step 1: Optimize for one I/O operation fitting into chunk ...........
753 log->debug("Cannot apply simple optimizations. Try to fit single I/O "
754 "operation into a chunk ...");
755
756 // Create the temporary container that will store the chunksize values.
757 // It starts with a copy of the extend values for I/O operations.
758 Cont _chunks(io_extend);
759
760 // Determine the size (in bytes) of a write operation with this extend
761 const auto bytes_io = bytes(io_extend);
762 log->debug(
763 "I/O operation size: {:7d} ({:.1f} kiB)", bytes_io, bytes_io / 1024.);
764
765 // Determine if an I/O operation fits into a single chunk, then decide on
766 // how to optimize accordingly
767 if (bytes_io > CHUNKSIZE_MAX)
768 {
769 // The I/O operation does _not_ fit into a chunk
770 // Aim to fit the I/O operation into the chunk -> target: max chunksize
771 log->debug("Single I/O operation does not fit into chunk.");
772 log->debug("Trying to use the fewest possible chunks for a single "
773 "I/O operation ...");
774
775 opt_chunks_target(_chunks,
776 CHUNKSIZE_MAX, // <- target value
777 typesize,
778 CHUNKSIZE_MAX,
779 CHUNKSIZE_MIN,
780 larger_high_dims,
781 log);
782 // NOTE The algorithm is also able to _increase_ the chunk size in
783 // certain dimensions. However, with _chunks == io_extend and the
784 // knowledge that the current bytesize of _chunks is above the
785 // maximum size, the chunk extensions will only be _reduced_.
786 }
787 else if (all_dims_inf && opt_inf_dims && bytes(_chunks) < CHUNKSIZE_BASE)
788 {
789 // The I/O operation _does_ fit into a chunk, but the dataset is
790 // infinite in _all directions_ and small chunksizes can be very
791 // inefficient -> optimize towards some base value
792 log->debug("Single I/O operation does fit into chunk.");
793 log->debug("Optimizing chunks in unlimited dimensions to be closer "
794 "to base chunksize ...");
795
796 opt_chunks_target(_chunks,
797 CHUNKSIZE_BASE, // <- target value
798 typesize,
799 CHUNKSIZE_MAX,
800 CHUNKSIZE_MIN,
801 larger_high_dims,
802 log);
803 // NOTE There is no issue with going beyond the maximum chunksize here
804 }
805 else
806 {
807 // no other optimization towards a target size make sense
808 log->debug("Single I/O operation does fit into a chunk.");
809 }
810
811 // To be on the safe side: Check that _chunks did not exceed max_extend
812 for (unsigned short i = 0; i < rank; i++)
813 {
814 if (_chunks[i] > max_extend[i])
815 {
816 log->warn("Optimization led to chunks larger than max_extend. "
817 "This should not have happened!");
818 _chunks[i] = max_extend[i];
819 }
820 }
821
822 // .. Step 2: Optimize by taking the max_extend into account ..............
823
824 // This is only possible if the current chunk size is not already above the
825 // upper limit, CHUNKSIZE_MAX, and the max_extend is not already reached.
826 // Also, it should not be enabled if the optimization towards unlimited
827 // dimensions was already performed
828 if (!(opt_inf_dims && all_dims_inf) && (_chunks != max_extend) &&
829 (bytes(_chunks) < CHUNKSIZE_MAX))
830 {
831 log->debug("Have max_extend information and can (potentially) use it "
832 "to optimize chunk extensions.");
833
835 max_extend,
836 typesize,
837 CHUNKSIZE_MAX,
838 opt_inf_dims,
839 larger_high_dims,
840 log);
841 }
842 // else: no further optimization possible
843
844 // Done.
845 // Make sure that chunksize is smaller than maximum chunksize
846 if (bytes(_chunks) > CHUNKSIZE_MAX)
847 {
848 throw std::runtime_error(
849 "Byte size of chunks " + to_str(_chunks) +
850 " is larger than CHUNKSIZE_MAX! This should not have happened!");
851 }
852
853 // Create a const version of the temporary chunks vector
854 const Cont chunks(_chunks);
855 log->info("Optimized chunk size: {}", to_str(chunks));
856
857 return chunks;
858}
void opt_chunks_with_max_extend(Cont &chunks, const Cont &max_extend, const hsize_t typesize, const unsigned int CHUNKSIZE_MAX, const bool opt_inf_dims, const bool larger_high_dims, const Logger &log)
Optimize chunk sizes using max_extend information.
Definition hdfchunking.hh:305
void opt_chunks_target(Cont &chunks, double bytes_target, const hsize_t typesize, const unsigned int CHUNKSIZE_MAX, const unsigned int CHUNKSIZE_MIN, const bool larger_high_dims, const Logger &log)
Optimizes the chunks along all axes to find a good default.
Definition hdfchunking.hh:113
std::string to_str(const Cont &vec)
Helper function to create a string representation of containers.
Definition hdfchunking.hh:65