edflow.data.dataset_mixin module¶
Summary¶
Classes:
A dataset which concatenates given datasets. |
|
Our fork of the chainer- |
|
A subset of a given dataset. |
Reference¶
-
class
edflow.data.dataset_mixin.
DatasetMixin
[source]¶ Bases:
object
Our fork of the chainer-
Dataset
class. Every Dataset used withedflow
should at some point inherit from this baseclass.Notes
Necessary and best practices
When implementing your own dataset you need to specify the following methods:
__len__
defines how many examples are in the datasetget_example
returns one of those examples given an index. The example must be a dictionary
Labels
Additionally the dataset class should specify an attribute
labels
, which works like a dictionary with lists or arrays behind each keyword, that have the same length as the dataset. The dictionary can also be empty if you do not want to define labels.The philosophy behind having both a
get_example()
method and thelabels
attribute is to split the dataset into compute heavy and easy parts. Labels should be quick to load at construction time, e.g. by loading a.npy
file or a.csv
. They can then be used to quickly manipulate the dataset. When getting the actual example we can do the heavy lifting like loading and/or manipulating images.Warning
Labels must be
dict
s ofnumpy
arrays and notlist
s! Otherwise many operations do not work and result in incomprehensible errors.Batching
As one usually works with batched datasets, the compute heavy steps can be hidden through parallelization. This is all done by the
make_batches()
, which is invoked byedflow
automatically.Default Behaviour
As one sometimes stacks and chains multiple levels of datasets it can become cumbersome to define
__len__
,get_example
andlabels
, if all one wants to do is evaluate their respective implementations of some other dataset, as can be seen in the code example below:SomeDerivedDataset(DatasetMixin): def __init__(self): self.other_data = SomeOtherDataset() self.labels = self.other_data.labels def __len__(self): return len(self.other_data) def get_example(self, idx): return self.other_data[idx]
This can be omitted when defining a
data
attribute when constructing the dataset.DatasetMixin
implements these methods with the default behaviour to wrap around the corresponding methods of the underlyingdata
attribute. Thus the above example becomesSomeDerivedDataset(DatasetMixin): def __init__(self): self.data = SomeOtherDataset()
If
self.data
has alabels
attribute, labels of the derived dataset will be taken fromself.data
.``+`` and ``*``
Sometimes you want to concatenate two datasets or multiply the length of one dataset by concatenating it several times to itself. This can easily be done by adding Datasets or multiplying one by an integer factor.
A = C + B # Adding two Datasets D = 3 * A # Multiplying two datasets
The above is equivalent to
A = ConcatenatedDataset(C, B) # Adding two Datasets D = ConcatenatedDataset(A, A, A) # Multiplying two datasets
Labels in the example ``dict``
Oftentimes it is good to store and load some values as lables as it can increase performance and decrease storage size, e.g. when storing scalar values. If you need these values to be returned by the
get_example()
method, simply activate this behaviour by setting the attributeappend_labels
toTrue
.SomeDerivedDataset(DatasetMixin): def __init__(self): self.labels = {'a': [1, 2, 3]} self.append_labels = True def get_example(self, idx): return {'a' : idx**2, 'b': idx} def __len__(self): return 3 S = SomeDerivedDataset() a = S[2] print(a) # {'a': 3, 'b': 2} S.append_labels = False a = S[2] print(a) # {'a': 4, 'b': 2}
Labels are appended to your example, after all code is executed from your
get_example
method. Thus, if there are keys in your labels, which can also be found in the examples, the label entries will override the values in you example, as can be seen in the example above.-
get_example
(*args, **kwargs)[source]¶ Note
Please the documentation of
DatasetMixin
to not be confused.Add default behaviour for datasets defining an attribute
data
, which in turn is a dataset. This happens often when stacking several datasets on top of each other.The default behaviour now is to return
self.data.get_example(idx)
if possible, and otherwise revert to the original behaviour.
-
property
labels
¶ Add default behaviour for datasets defining an attribute
data
, which in turn is a dataset. This happens often when stacking several datasets on top of each other.The default behaviour is to return
self.data.labels
if possible, and otherwise revert to the original behaviour.
-
property
append_labels
¶
-
property
expand
¶
-
class
edflow.data.dataset_mixin.
ConcatenatedDataset
(*datasets, balanced=False)[source]¶ Bases:
edflow.data.dataset_mixin.DatasetMixin
A dataset which concatenates given datasets.
-
__init__
(*datasets, balanced=False)[source]¶ - Parameters
*datasets (DatasetMixin) – All datasets we want to concatenate
balanced (bool) – If
True
all datasets are padded to the length of the longest dataset. Padding is done in a cycled fashion.
-
property
labels
¶ Add default behaviour for datasets defining an attribute
data
, which in turn is a dataset. This happens often when stacking several datasets on top of each other.The default behaviour is to return
self.data.labels
if possible, and otherwise revert to the original behaviour.
-
-
class
edflow.data.dataset_mixin.
SubDataset
(data, subindices)[source]¶ Bases:
edflow.data.dataset_mixin.DatasetMixin
A subset of a given dataset.
-
get_example
(i)[source]¶ Get example and process. Wrapped to make sure stacktrace is printed in case something goes wrong and we are in a MultiprocessIterator.
-
property
labels
¶ Add default behaviour for datasets defining an attribute
data
, which in turn is a dataset. This happens often when stacking several datasets on top of each other.The default behaviour is to return
self.data.labels
if possible, and otherwise revert to the original behaviour.
-