Handling Data#

In this section we will take a look at utopya’s capabilities to handle simulation data.

Todo

Expand this page.


The DataManager#

Objects of this class are the home of all your simulation data. One such DataManager object is initialized together with the Multiverse and thereafter available as dm attribute. It is set up with a load configuration and, upon invocation of its load_from_cfg() method, will load the simulation data using that configuration. It is equipped to handle hierarchical data, storing it as a data tree.

Hint

To visually inspect the tree representation, you can use the tree property: print(dm.tree). This also works with every group-like member of the tree.

This functionality is all based on the dantro package, which provides a uniform interface to handle hierarchically structured data. However, while the interface is uniform, the parts of the data tree can be adapted to ideally handle the underlying data.

One example of a specialization is the GridDC class, which is a specialization of a data container that represents data from a grid. It is tightly coupled to the data output of Utopia on C++ side, where the most efficient way to write data is along the index of the entities, rather than the x and y coordinates. However, to handle that data, one expects data with the dimensions x, y, and time; the GridDC takes care of reshaping that data in this way.

Handling Large Amounts of Data#

To handle large amounts of simulation data (which is not uncommon), the DataManager provides so-called proxy loading for HDF5 data: instead of loading the data directly into memory, the structure and metadata of the HDF5 file is used only to generate the data tree. At the place where normally the data would be stored in the data containers, a proxy object is placed (in this case: Hdf5Proxy). Upon access to the data, the proxy gets automatically resolved, leading to the data being loaded into memory and replacing the proxy object in the data container.

Objects that were loaded as proxy are marked with (proxy) in the tree representation. To load HDF5 data as proxy, use the hdf5_proxy loader in the Default Load Configuration.

These proxy objects already make handling large amounts of data much easier, because the data is only loaded if needed.

Loading files in parallel#

Despite data being loaded into proxy objects, this process can take a considerable amount of time if there are many groups or datasets in the to-be-loaded HDF5 files. In such a case, the DataManager will be busy mostly with creating the corresponding Python objects, and less so with loading the actual data from the files. (In other words, this would be a task that is CPU-limited, not I/O limited.)

Subsequently, there is a benefit in using multiple CPUs to build the data tree in such scenarios. The dantro data loading interface supports parallel loading and Utopia allows to control this behavior directly via the CLI:

utopia eval MyModel --load-parallel

The above command will enable parallel loading and it will use all available CPUs for that; see the CLI --help for details.

If you want more control, you can also directly configure it via the meta-configuration. Have a look at the corresponding section in the Multiverse Base Configuration for available options, e.g. for using parallel loading depending on the number of files or their total file size:

parallel:
  enabled: false

  # Number of processes to use; negative is deduced from os.cpu_count()
  processes: ~

  # Threshold values for parallel loading; if any is below these
  # numbers, loading will *not* be in parallel.
  min_files: 5
  min_total_size: 104857600  # 100 MiB

Hint

The parallel option is basically available for every entry in the data_manager.load_cfg. However, given the constant overhead of starting new loader processes, it makes most sense for the data entry, where the HDF5 files’ content is loaded.

How about huge amounts of data?#

There will be scenarios in which the data to be analyzed exceeds the limits of the physical memory of the machine. Here, proxy objects don’t help, as they only postpone the loading.

For that purpose, dantro, which heavily relies on xarray for the representation of numerical data, is able to make use of its dask integration. The dask package allows working on chunked data, e.g. HDF5 data, and only loads those parts that are necessary for a calculation, afterwards freeing up the memory again. Additionally, it does clever things by first building a tree of operations that are to be performed, then optimizing that tree, and only when the actual numerical result is needed, does the data need to be loaded. Furthermore, as the data is chunked, it can potentially profit from parallel computation. More on that can be found here.

To use dask when loading Utopia data, arguments need to be passed to the proxy that it should not be resolved as the actual data, but as a dask representation of it. This is done by setting the resolve_as_dask flag. Arguments can be passed to the proxy by adding the proxy_kwargs argument to the configuration of a data entry. Add the following part to the root level of your run configuration, which will update the defaults:

data_manager:
  load_cfg:
    data:
      proxy_kwargs:
        resolve_as_dask: true

parameter_space:
  # ... your usual arguments

Note

When plotting data via utopia eval, you can also specify a run configuration. Check the utopia eval --help to find out how.

Once this succeeded, you will see proxy (hdf5, dask) in the tree representation of your loaded data.

There are two other ways to set this entry (following Utopia’s configuration hierarchy principle):

  • In the CLI, you can additionally use the --set-cfg argument for utopia eval and utopia run to set the entry:

utopia eval MyModel --set-cfg data_manager.load_cfg.data.proxy_kwargs.resolve_as_dask=true
  • To permanently set this entry, you can write it to your user configuration:

utopia config user --get --set data_manager.load_cfg.data.proxy_kwargs.resolve_as_dask=true

This then applies to all models you work with. As dask does slow down some operations, it only makes sense to set this if you are mostly working with large data and tend to forget enabling dask!

Configuration and API Reference#

Default Load Configuration#

Below, the default DataManager configuration is included, which also specifies the default load configuration. Each entry of the load_cfg key refers to one so-called “data entry”. Files that match the glob_str are loaded using a certain loader and placed at a target_path within the data tree.

# The DataManager takes care of loading the data into a tree-like structure
# after the simulations are finished.
# It is based on the DataManager class from the dantro package. See there for
# full documentation.
data_manager:
  # Where to create the output directory for this DataManager, relative to
  # the run directory of the Multiverse.
  out_dir: eval/{timestamp:}
  # The {timestamp:} placeholder is replaced by the current timestamp such that
  # future DataManager instances that operate on the same data directory do
  # not create collisions.
  # Directories are created recursively, if they do not exist.

  # Define the structure of the data tree beforehand; this allows to specify
  # the types of groups before content is loaded into them.
  # NOTE The strings given to the Cls argument are mapped to a type using a
  #      class variable of the DataManager
  create_groups:
    - path: multiverse
      Cls: MultiverseGroup

  # Where the default tree cache file is located relative to the data
  # directory. This is used when calling DataManager.dump and .restore without
  # any arguments, as done e.g. in the Utopia CLI.
  default_tree_cache_path: data/.tree_cache.d3

  # Supply a default load configuration for the DataManager
  # This can then be invoked using the dm.load_from_cfg() method.
  load_cfg:
    # Load the frontend configuration files from the config/ directory
    # Each file refers to a level of the configuration that is supplied to
    # the Multiverse: base <- user <- model <- run <- update
    cfg:
      loader: yaml                          # The loader function to use
      glob_str: 'config/*.yml'              # Which files to load
      ignore:                               # Which files to ignore
        - config/parameter_space.yml
        - config/parameter_space_info.yml
        - config/full_parameter_space.yml
        - config/full_parameter_space_info.yml
      required: true                        # Whether these files are required
      path_regex: config/(\w+)_cfg.yml      # Extract info from the file path
      target_path: cfg/{match:}             # ...and use in target path

    # Load the parameter space object into the MultiverseGroup attributes
    pspace:
      loader: yaml_to_object                # Load into ObjectContainer
      glob_str: config/parameter_space.yml
      required: true
      load_as_attr: true
      unpack_data: true                     # ... and store as ParamSpace obj.
      target_path: multiverse

    # Load the configuration files that are generated for _each_ simulation
    # These hold all information that is available to a single simulation and
    # are in an explicit, human-readable form.
    uni_cfg:
      loader: yaml
      glob_str: data/uni*/config.yml
      required: true
      path_regex: data/uni(\d+)/config.yml
      target_path: multiverse/{match:}/cfg
      parallel:
        enabled: true
        min_files: 1000
        min_total_size: 1048576  # 1 MiB

    # Load the binary output data from each simulation.
    data:
      loader: hdf5_proxy
      glob_str: data/uni*/data.h5
      required: true
      path_regex: data/uni(\d+)/data.h5
      target_path: multiverse/{match:}/data
      enable_mapping: true   # see DataManager for content -> type mapping

      # Options for loading data in parallel (speeds up CPU-limited loading)
      parallel:
        enabled: false

        # Number of processes to use; negative is deduced from os.cpu_count()
        processes: ~

        # Threshold values for parallel loading; if any is below these
        # numbers, loading will *not* be in parallel.
        min_files: 5
        min_total_size: 104857600  # 100 MiB

DataManager#

class utopya.datamanager.DataManager(data_dir: str, *, name: Optional[str] = None, load_cfg: Optional[Union[dict, str]] = None, out_dir: Union[str, bool] = '_output/{timestamp:}', out_dir_kwargs: Optional[dict] = None, create_groups: Optional[List[Union[str, dict]]] = None, condensed_tree_params: Optional[dict] = None, default_tree_cache_path: Optional[str] = None)[source]

Bases: dantro.data_loaders.AllAvailableLoadersMixin, dantro.data_mngr.DataManager

This class manages the data that is written out by Utopia simulations.

It is based on the dantro.DataManager class and adds the functionality for specific loader functions that are needed in Utopia: Hdf5 and Yaml.

Furthermore, to enable file caching via the DAG framework, all available data loaders are included here.

Initializes a DataManager for the specified data directory.

Parameters
  • data_dir (str) – the directory the data can be found in. If this is a relative path, it is considered relative to the current working directory.

  • name (str, optional) – which name to give to the DataManager. If no name is given, the data directories basename will be used

  • load_cfg (Union[dict, str], optional) – The base configuration used for loading data. If a string is given, assumes it to be the path to a YAML file and loads it using the load_yml() function. If None is given, it can still be supplied to the load() method later on.

  • out_dir (Union[str, bool], optional) – where output is written to. If this is given as a relative path, it is considered relative to the data_dir. A formatting operation with the keys timestamp and name is performed on this, where the latter is the name of the data manager. If set to False, no output directory is created.

  • out_dir_kwargs (dict, optional) – Additional arguments that affect how the output directory is created.

  • create_groups (List[Union[str, dict]], optional) – If given, these groups will be created after initialization. If the list entries are strings, the default group class will be used; if they are dicts, the name key specifies the name of the group and the Cls key specifies the type. If a string is given instead of a type, the lookup happens from the _DATA_GROUP_CLASSES variable.

  • condensed_tree_params (dict, optional) – If given, will set the parameters used for the condensed tree representation. Available options: max_level and condense_thresh, where the latter may be a callable. See dantro.base.BaseDataGroup._tree_repr() for more information.

  • default_tree_cache_path (str, optional) – The path to the default tree cache file. If not given, uses the value from the class variable _DEFAULT_TREE_CACHE_PATH. Whichever value was chosen is then prepared using the _parse_file_path() method, which regards relative paths as being relative to the associated data directory.

__abstractmethods__ = frozenset({})
__annotations__ = {}
__module__ = 'utopya.datamanager'