docarray.array.mixins.dataloader.helper module#

class docarray.array.mixins.dataloader.helper.DocumentArrayLoader(path, protocol='protobuf', compress=None, show_progress=False)[source]#

Bases: ParallelMixin, GroupMixin

apply(*args, **kwargs)#

Apply func to every Document in itself, return itself after modification.

Parameters:

func – a function that takes Document as input and outputs Document.
backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

Warning

When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.
num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
show_progress – show a progress bar

Return type:

Returns:

itself after modification

apply_batch(*args, **kwargs)#

Batches itself into mini-batches, applies func to every mini-batch, and return itself after the modifications.

EXAMPLE USAGE

from docarray import Document, DocumentArray

da = DocumentArray([Document(text='The cake is a lie') for _ in range(100)])

def func(doc):
    da.texts = [t.upper() for t in da.texts]
    return da

da.apply_batch(func, batch_size=10)
print(da.texts[:3])

['THE CAKE IS A LIE', 'THE CAKE IS A LIE', 'THE CAKE IS A LIE']

Parameters:

func – a function that takes DocumentArray as input and outputs DocumentArray.
backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

Warning

When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.
num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
batch_size – Size of each generated batch (except the last batch, which might be smaller). Default: 32
shuffle – If set, shuffle the Documents before dividing into minibatches.
show_progress – show a progress bar
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.

Return type:

Returns:

itself after modification

batch(batch_size, shuffle=False, show_progress=False)#

Creates a Generator that yields DocumentArray of size batch_size until docs is fully traversed along the traversal_path. The None docs are filtered out and optionally the docs can be filtered by checking for the existence of a Document attribute. Note, that the last batch might be smaller than batch_size.

Parameters:

batch_size (int) – Size of each generated batch (except the last one, which might be smaller, default: 32)
shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.
show_progress (bool) – if set, show a progress bar when batching documents.

Yield:

a Generator of DocumentArray, each in the length of batch_size

Return type:

Generator[DocumentArray, None, None]

batch_ids(batch_size, shuffle=False)#

Creates a Generator that yields lists of ids of size batch_size until self is fully traversed. Note, that the last batch might be smaller than batch_size.

Parameters:

batch_size (int) – Size of each generated batch (except the last one, which might be smaller)
shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.

Yield:

a Generator of list of IDs, each in the length of batch_size

Return type:

Generator[List[str], None, None]

map(func, backend='thread', num_worker=None, show_progress=False, pool=None)#

Return an iterator that applies function to every element of iterable in parallel, yielding the results.