edflow.data.util.util_dsets module¶

Summary¶

Classes:

`DataFolder`	Given the root of a possibly nested folder containing datafiles and a Callable that generates the labels to the datafile from its full name, this class creates a labeled dataset.
`RandomlyJoinedDataset`	Load multiple examples which have the same label.

Functions:

`JoinedDataset`	Concat n_joins random samples based on the condition that example_i[key] == example_j[key] for all i,j.
`getDebugDataset`	Loads a dataset from the config and makes ist reasonably small.

Reference¶

edflow.data.util.util_dsets.JoinedDataset(dataset, key, n_joins)[source]¶: Concat n_joins random samples based on the condition that example_i[key] == example_j[key] for all i,j. Key must be in labels of dataset.

edflow.data.util.util_dsets.getDebugDataset(config)[source]¶

Loads a dataset from the config and makes ist reasonably small. The config syntax works as in getSeqDataset(). See there for more extensive documentation.

Parameters

config (dict) –

An edflow config, with at least the keys: debugdataset and nested inside it dataset, debug_length, defining the basedataset and its size.

Returns

A dataset based on the basedataset of the specifed length.

Return type

SubDataset

class edflow.data.util.util_dsets.RandomlyJoinedDataset(config)[source]¶

Bases: edflow.data.dataset_mixin.DatasetMixin, edflow.util.PRNGMixin

Load multiple examples which have the same label.

Required config parameters:

RandomlyJoinedDataset/dataset: The dataset from which to load examples.
RandomlyJoinedDataset/key: The key of the label to join on.

Optional config parameters:

test_mode=False: If True, behaves deterministic.
RandomlyJoinedDataset/n_joins=2: How many examples to load.
RandomlyJoinedDataset/balance=False: If True and not in test_mode, sample join labels uniformly.
RandomlyJoinedDataset/avoid_identity=True: If True and not in test_mode, never return a pair containing the same image if possible.

The i-th example returns:

‘examples’: A list of examples, where each example has the same label as specified by key. If data_balancing is False, the first element of the list will be the i-th example of the dataset.

The dataset’s labels are the same as that of dataset. Be careful, examples[j] of the i-th example does not correspond to the i-th entry of the labels but to the examples[j][“index_”]-th entry.

__init__(config)[source]¶: Initialize self. See help(type(self)) for accurate signature.

property labels¶: Careful this can only give labels of the original item, not the joined ones. Use ‘examples[j][“index_”]’ to get the correct label index.

get_example(i)[source]¶

Note

Please the documentation of DatasetMixin to not be confused.

Add default behaviour for datasets defining an attribute data, which in turn is a dataset. This happens often when stacking several datasets on top of each other.

The default behaviour now is to return self.data.get_example(idx) if possible, and otherwise revert to the original behaviour.

class edflow.data.util.util_dsets.DataFolder(image_root, read_fn, label_fn, sort_keys=None, in_memory_keys=None, legacy=True, show_bar=False)[source]¶

Bases: edflow.data.dataset_mixin.DatasetMixin

Given the root of a possibly nested folder containing datafiles and a Callable that generates the labels to the datafile from its full name, this class creates a labeled dataset.

A filtering of unwanted Data can be achieved by having the label_fn return None for those specific files. The actual files are only read when __getitem__ is called.

If for example label_fn returns a dict with the keys ['a', 'b', 'c'] and read_fn returns one with keys ['d', 'e'] then the dict returned by __getitem__ will contain the keys ['a', 'b', 'c', 'd', 'e', 'file_path_', 'index_'].

__init__(image_root, read_fn, label_fn, sort_keys=None, in_memory_keys=None, legacy=True, show_bar=False)[source]¶

Parameters

image_root (str) – Root containing the files of interest.
read_fn (Callable) – Given the path to a file, returns the datum as a dict.
label_fn (Callable) – Given the path to a file, returns a dict of labels. If label_fn returns None, this file is ignored.
sort_keys (list) – A hierarchy of keys by which the data in this Dataset are sorted.
in_memory_keys (list) – keys which will be collected from examples when the dataset is cached.
legacy (bool) – Use the old read ethod, where only the path to the current file is passed to the reader. The new version will see all labels, that have been previously collected.
show_bar (bool) – Show a loading bar when loading labels.

get_example(i)[source]¶: Load the files specified in example i.