edflow.data.util.cached_dset module

Summary

Classes:

CachedDataset

Using a Dataset of single examples creates a cached (saved to memory) version, which can be accessed way faster at runtime.

ExamplesFolder

Contains all examples and labels of a cached dataset.

PathCachedDataset

Used for simplified decorator interface to dataset caching.

Functions:

cachable

Decorator to cache datasets.

make_client_manager

make_server_manager

pickle_and_queue

Parallelizable function to retrieve and queue examples from a Dataset.

Reference

edflow.data.util.cached_dset.make_server_manager(port=63127, authkey=b'edcache')[source]
edflow.data.util.cached_dset.make_client_manager(ip, port=63127, authkey=b'edcache')[source]
edflow.data.util.cached_dset.pickle_and_queue(dataset_factory, inqueue, outqueue, naming_template='example_{}.p')[source]

Parallelizable function to retrieve and queue examples from a Dataset.

Parameters
  • dataset_factory (chainer.DatasetMixin) – A dataset factory, with methods described in CachedDataset.

  • indices (list) – List of indices, used to retrieve samples from dataset.

  • queue (mp.Queue) – Queue to put the samples in.

  • naming_template (str) – Formatable string, which defines the name of the stored file given its index.

class edflow.data.util.cached_dset.ExamplesFolder(root)[source]

Bases: object

Contains all examples and labels of a cached dataset.

__init__(root)[source]

Initialize self. See help(type(self)) for accurate signature.

read(name)[source]
class edflow.data.util.cached_dset.CachedDataset(dataset, force_cache=False, keep_existing=True, _legacy=True, chunk_size=64)[source]

Bases: edflow.data.dataset_mixin.DatasetMixin

Using a Dataset of single examples creates a cached (saved to memory) version, which can be accessed way faster at runtime.

To avoid creating the dataset multiple times, it is checked if the cached version already exists.

Calling __getitem__ on this class will try to retrieve the samples from the cached dataset to reduce the preprocessing overhead.

The cached dataset will be stored in the root directory of the base dataset in the subfolder cached with name name.zip.

Besides the usual DatasetMixin interface, datasets to be cached must also implement

root # (str) root folder to cache into name # (str) unqiue name

Optionally but highly recommended, they should provide

in_memory_keys # list(str) keys which will be collected from examples

The collected values are stored in a dict of list, mapping an in_memory_key to a list containing the i-ths value at the i-ths place. This data structure is then exposed via the attribute labels and enables rapid iteration over useful labels without loading each example seperately. That way, downstream datasets can filter the indices of the cached dataset efficiently, e.g. filtering based on train/eval splits.

Caching proceeds as follows: Expose a method which returns the dataset to be cached, e.g.

def DataToCache():

path = “/path/to/data” return MyCachableDataset(path)

Start caching server on host <server_ip_or_hostname>:

edcache –server –dataset import.path.to.DataToCache

Wake up a worker bee on same or different hosts:

edcache –address <server_ip_or_hostname> –dataset import.path.to.DataCache # noqa

Start a cacherhive!

__init__(dataset, force_cache=False, keep_existing=True, _legacy=True, chunk_size=64)[source]

Given a dataset class, stores all examples in the dataset, if this has not yet happened.

Parameters
  • dataset (object) –

    Dataset class which defines the following methods:

    • root: returns the path to the raw data

    • name: returns the name of the dataset -> best be unique

    • __len__: number of examples in the dataset

    • __getitem__: returns a sindle datum

    • in_memory_keys: returns all keys, that are stored

    alongside the dataset, in a labels.p file. This allows to retrive labels more quickly and can be used to filter the data more easily.

  • force_cache (bool) – If True the dataset is cached even if an existing, cached version is overwritten.

  • keep_existing (bool) – If True, existing entries in cache will not be recomputed and only non existing examples are appended to the cache. Useful if caching was interrupted.

  • _legacy (bool) – Read from the cached Zip file. Deprecated mode. Future Datasets should not write into zips as read times are very long.

  • chunksize (int) – Length of the index list that is sent to the worker.

classmethod from_cache(root, name, _legacy=True)[source]

Use this constructor to avoid initialization of original dataset which can be useful if only the cached zip file is available or to avoid expensive constructors of datasets.

property fork_safe_zip
cache_dataset()[source]

Checks if a dataset is stored. If not iterates over all possible indices and stores the examples in a file, as well as the labels.

property labels

Returns the labels associated with the base dataset, but from the cached source.

property root

Returns the root to the base dataset.

get_example(i)[source]

Given an index i, returns a example.

class edflow.data.util.cached_dset.PathCachedDataset(dataset, path)[source]

Bases: edflow.data.util.cached_dset.CachedDataset

Used for simplified decorator interface to dataset caching.

__init__(dataset, path)[source]

Given a dataset class, stores all examples in the dataset, if this has not yet happened.

Parameters
  • dataset (object) –

    Dataset class which defines the following methods:

    • root: returns the path to the raw data

    • name: returns the name of the dataset -> best be unique

    • __len__: number of examples in the dataset

    • __getitem__: returns a sindle datum

    • in_memory_keys: returns all keys, that are stored

    alongside the dataset, in a labels.p file. This allows to retrive labels more quickly and can be used to filter the data more easily.

  • force_cache (bool) – If True the dataset is cached even if an existing, cached version is overwritten.

  • keep_existing (bool) – If True, existing entries in cache will not be recomputed and only non existing examples are appended to the cache. Useful if caching was interrupted.

  • _legacy (bool) – Read from the cached Zip file. Deprecated mode. Future Datasets should not write into zips as read times are very long.

  • chunksize (int) – Length of the index list that is sent to the worker.

edflow.data.util.cached_dset.cachable(path)[source]

Decorator to cache datasets. If not cached, will start a caching server, subsequent calls will just load from cache. Currently all worker must be able to see the path. Be careful, function parameters are ignored on furture calls. Can be used on any callable that returns a dataset. Currently the path should be the path to a zip file to cache into - i.e. it should end in zip.