edflow.data.dataset_mixin module¶

Summary¶

Classes:

`ConcatenatedDataset`	A dataset which concatenates given datasets.
`DatasetMixin`	Our fork of the chainer-`Dataset` class.Every Dataset used with `edflow` should at some point inherit from this baseclass..
`SubDataset`	A subset of a given dataset.

Reference¶

class edflow.data.dataset_mixin.DatasetMixin[source]¶

Bases: object

Our fork of the chainer-Dataset class. Every Dataset used with edflow should at some point inherit from this baseclass.

Notes

Necessary and best practices

When implementing your own dataset you need to specify the following methods:

__len__ defines how many examples are in the dataset

get_example returns one of those examples given an index. The example must be a dictionary

Labels

Additionally the dataset class should specify an attribute labels, which works like a dictionary with lists or arrays behind each keyword, that have the same length as the dataset. The dictionary can also be empty if you do not want to define labels.

The philosophy behind having both a get_example() method and the labels attribute is to split the dataset into compute heavy and easy parts. Labels should be quick to load at construction time, e.g. by loading a .npy file or a .csv. They can then be used to quickly manipulate the dataset. When getting the actual example we can do the heavy lifting like loading and/or manipulating images.

Warning

Labels must be dict s of numpy arrays and not list s! Otherwise many operations do not work and result in incomprehensible errors.

Batching

As one usually works with batched datasets, the compute heavy steps can be hidden through parallelization. This is all done by the make_batches(), which is invoked by edflow automatically.

Default Behaviour

As one sometimes stacks and chains multiple levels of datasets it can become cumbersome to define __len__, get_example and labels, if all one wants to do is evaluate their respective implementations of some other dataset, as can be seen in the code example below:

SomeDerivedDataset(DatasetMixin):
    def __init__(self):
        self.other_data = SomeOtherDataset()
        self.labels = self.other_data.labels

    def __len__(self):
        return len(self.other_data)

    def get_example(self, idx):
        return self.other_data[idx]

This can be omitted when defining a data attribute when constructing the dataset. DatasetMixin implements these methods with the default behaviour to wrap around the corresponding methods of the underlying data attribute. Thus the above example becomes

SomeDerivedDataset(DatasetMixin):
    def __init__(self):
        self.data = SomeOtherDataset()

If self.data has a labels attribute, labels of the derived dataset will be taken from self.data.

``+`` and ``*``

Sometimes you want to concatenate two datasets or multiply the length of one dataset by concatenating it several times to itself. This can easily be done by adding Datasets or multiplying one by an integer factor.

A = C + B  # Adding two Datasets
D = 3 * A  # Multiplying two datasets

The above is equivalent to

A = ConcatenatedDataset(C, B)  # Adding two Datasets
D = ConcatenatedDataset(A, A, A)  # Multiplying two datasets

Labels in the example ``dict``

Oftentimes it is good to store and load some values as lables as it can increase performance and decrease storage size, e.g. when storing scalar values. If you need these values to be returned by the get_example() method, simply activate this behaviour by setting the attribute append_labels to True.

SomeDerivedDataset(DatasetMixin):
    def __init__(self):
        self.labels = {'a': [1, 2, 3]}
        self.append_labels = True

    def get_example(self, idx):
        return {'a' : idx**2, 'b': idx}

    def __len__(self):
        return 3

S = SomeDerivedDataset()
a = S[2]
print(a)  # {'a': 3, 'b': 2}

S.append_labels = False
a = S[2]
print(a)  # {'a': 4, 'b': 2}

Labels are appended to your example, after all code is executed from your get_example method. Thus, if there are keys in your labels, which can also be found in the examples, the label entries will override the values in you example, as can be seen in the example above.

get_example(*args, **kwargs)[source]¶

Note

Please the documentation of DatasetMixin to not be confused.

Add default behaviour for datasets defining an attribute data, which in turn is a dataset. This happens often when stacking several datasets on top of each other.

The default behaviour now is to return self.data.get_example(idx) if possible, and otherwise revert to the original behaviour.

property labels¶

Add default behaviour for datasets defining an attribute data, which in turn is a dataset. This happens often when stacking several datasets on top of each other.

The default behaviour is to return self.data.labels if possible, and otherwise revert to the original behaviour.

property append_labels¶

property expand¶

class edflow.data.dataset_mixin.ConcatenatedDataset(*datasets, balanced=False)[source]¶

Bases: edflow.data.dataset_mixin.DatasetMixin

A dataset which concatenates given datasets.

__init__(*datasets, balanced=False)[source]¶

Parameters

*datasets (DatasetMixin) – All datasets we want to concatenate
balanced (bool) – If True all datasets are padded to the length of the longest dataset. Padding is done in a cycled fashion.

get_example(i)[source]¶: Get example and add dataset index to it.

property labels¶

Add default behaviour for datasets defining an attribute data, which in turn is a dataset. This happens often when stacking several datasets on top of each other.

The default behaviour is to return self.data.labels if possible, and otherwise revert to the original behaviour.

class edflow.data.dataset_mixin.SubDataset(data, subindices)[source]¶

Bases: edflow.data.dataset_mixin.DatasetMixin

A subset of a given dataset.

__init__(data, subindices)[source]¶: Initialize self. See help(type(self)) for accurate signature.

get_example(i)[source]¶: Get example and process. Wrapped to make sure stacktrace is printed in case something goes wrong and we are in a MultiprocessIterator.

property labels¶

Add default behaviour for datasets defining an attribute data, which in turn is a dataset. This happens often when stacking several datasets on top of each other.

The default behaviour is to return self.data.labels if possible, and otherwise revert to the original behaviour.