edflow.data.util.cached_dset module¶
Summary¶
Classes:
Using a Dataset of single examples creates a cached (saved to memory) version, which can be accessed way faster at runtime. |
|
Contains all examples and labels of a cached dataset. |
|
Used for simplified decorator interface to dataset caching. |
Functions:
Decorator to cache datasets. |
|
Parallelizable function to retrieve and queue examples from a Dataset. |
Reference¶
-
edflow.data.util.cached_dset.
pickle_and_queue
(dataset_factory, inqueue, outqueue, naming_template='example_{}.p')[source]¶ Parallelizable function to retrieve and queue examples from a Dataset.
- Parameters
dataset_factory (chainer.DatasetMixin) – A dataset factory, with methods described in
CachedDataset
.indices (list) – List of indices, used to retrieve samples from dataset.
queue (mp.Queue) – Queue to put the samples in.
naming_template (str) – Formatable string, which defines the name of the stored file given its index.
-
class
edflow.data.util.cached_dset.
ExamplesFolder
(root)[source]¶ Bases:
object
Contains all examples and labels of a cached dataset.
-
class
edflow.data.util.cached_dset.
CachedDataset
(dataset, force_cache=False, keep_existing=True, _legacy=True, chunk_size=64)[source]¶ Bases:
edflow.data.dataset_mixin.DatasetMixin
Using a Dataset of single examples creates a cached (saved to memory) version, which can be accessed way faster at runtime.
To avoid creating the dataset multiple times, it is checked if the cached version already exists.
Calling __getitem__ on this class will try to retrieve the samples from the cached dataset to reduce the preprocessing overhead.
The cached dataset will be stored in the root directory of the base dataset in the subfolder cached with name name.zip.
Besides the usual DatasetMixin interface, datasets to be cached must also implement
root # (str) root folder to cache into name # (str) unqiue name
Optionally but highly recommended, they should provide
in_memory_keys # list(str) keys which will be collected from examples
The collected values are stored in a dict of list, mapping an in_memory_key to a list containing the i-ths value at the i-ths place. This data structure is then exposed via the attribute labels and enables rapid iteration over useful labels without loading each example seperately. That way, downstream datasets can filter the indices of the cached dataset efficiently, e.g. filtering based on train/eval splits.
Caching proceeds as follows: Expose a method which returns the dataset to be cached, e.g.
- def DataToCache():
path = “/path/to/data” return MyCachableDataset(path)
Start caching server on host <server_ip_or_hostname>:
edcache –server –dataset import.path.to.DataToCache
Wake up a worker bee on same or different hosts:
edcache –address <server_ip_or_hostname> –dataset import.path.to.DataCache # noqa
Start a cacherhive!
-
__init__
(dataset, force_cache=False, keep_existing=True, _legacy=True, chunk_size=64)[source]¶ Given a dataset class, stores all examples in the dataset, if this has not yet happened.
- Parameters
dataset (object) –
Dataset class which defines the following methods:
root: returns the path to the raw data
name: returns the name of the dataset -> best be unique
__len__: number of examples in the dataset
__getitem__: returns a sindle datum
in_memory_keys: returns all keys, that are stored
alongside the dataset, in a labels.p file. This allows to retrive labels more quickly and can be used to filter the data more easily.
force_cache (bool) – If True the dataset is cached even if an existing, cached version is overwritten.
keep_existing (bool) – If True, existing entries in cache will not be recomputed and only non existing examples are appended to the cache. Useful if caching was interrupted.
_legacy (bool) – Read from the cached Zip file. Deprecated mode. Future Datasets should not write into zips as read times are very long.
chunksize (int) – Length of the index list that is sent to the worker.
-
classmethod
from_cache
(root, name, _legacy=True)[source]¶ Use this constructor to avoid initialization of original dataset which can be useful if only the cached zip file is available or to avoid expensive constructors of datasets.
-
property
fork_safe_zip
¶
-
cache_dataset
()[source]¶ Checks if a dataset is stored. If not iterates over all possible indices and stores the examples in a file, as well as the labels.
-
property
labels
¶ Returns the labels associated with the base dataset, but from the cached source.
-
property
root
¶ Returns the root to the base dataset.
-
class
edflow.data.util.cached_dset.
PathCachedDataset
(dataset, path)[source]¶ Bases:
edflow.data.util.cached_dset.CachedDataset
Used for simplified decorator interface to dataset caching.
-
__init__
(dataset, path)[source]¶ Given a dataset class, stores all examples in the dataset, if this has not yet happened.
- Parameters
dataset (object) –
Dataset class which defines the following methods:
root: returns the path to the raw data
name: returns the name of the dataset -> best be unique
__len__: number of examples in the dataset
__getitem__: returns a sindle datum
in_memory_keys: returns all keys, that are stored
alongside the dataset, in a labels.p file. This allows to retrive labels more quickly and can be used to filter the data more easily.
force_cache (bool) – If True the dataset is cached even if an existing, cached version is overwritten.
keep_existing (bool) – If True, existing entries in cache will not be recomputed and only non existing examples are appended to the cache. Useful if caching was interrupted.
_legacy (bool) – Read from the cached Zip file. Deprecated mode. Future Datasets should not write into zips as read times are very long.
chunksize (int) – Length of the index list that is sent to the worker.
-
-
edflow.data.util.cached_dset.
cachable
(path)[source]¶ Decorator to cache datasets. If not cached, will start a caching server, subsequent calls will just load from cache. Currently all worker must be able to see the path. Be careful, function parameters are ignored on furture calls. Can be used on any callable that returns a dataset. Currently the path should be the path to a zip file to cache into - i.e. it should end in zip.