docarray.array.mixins.dataloader.helper module#

class docarray.array.mixins.dataloader.helper.DocumentArrayLoader(path, protocol='protobuf', compress=None, show_progress=False)[source]#

Bases: ParallelMixin, GroupMixin

apply(*args, **kwargs)#

Apply func to every Document in itself, return itself after modification.

Parameters:
  • func – a function that takes Document as input and outputs Document.

  • backend

    thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.

  • num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.

  • show_progress – show a progress bar

Return type:

T

Returns:

itself after modification

apply_batch(*args, **kwargs)#

Batches itself into mini-batches, applies func to every mini-batch, and return itself after the modifications.

EXAMPLE USAGE

from docarray import Document, DocumentArray

da = DocumentArray([Document(text='The cake is a lie') for _ in range(100)])


def func(doc):
    da.texts = [t.upper() for t in da.texts]
    return da


da.apply_batch(func, batch_size=10)
print(da.texts[:3])
['THE CAKE IS A LIE', 'THE CAKE IS A LIE', 'THE CAKE IS A LIE']
Parameters:
  • func – a function that takes DocumentArray as input and outputs DocumentArray.

  • backend

    thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.

  • num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • batch_size – Size of each generated batch (except the last batch, which might be smaller). Default: 32

  • shuffle – If set, shuffle the Documents before dividing into minibatches.

  • show_progress – show a progress bar

  • pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.

Return type:

T

Returns:

itself after modification

batch(batch_size, shuffle=False, show_progress=False)#

Creates a Generator that yields DocumentArray of size batch_size until docs is fully traversed along the traversal_path. The None docs are filtered out and optionally the docs can be filtered by checking for the existence of a Document attribute. Note, that the last batch might be smaller than batch_size.

Parameters:
  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller, default: 32)

  • shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.

  • show_progress (bool) – if set, show a progress bar when batching documents.

Yield:

a Generator of DocumentArray, each in the length of batch_size

Return type:

Generator[DocumentArray, None, None]

batch_ids(batch_size, shuffle=False)#

Creates a Generator that yields lists of ids of size batch_size until self is fully traversed. Note, that the last batch might be smaller than batch_size.

Parameters:
  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller)

  • shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.

Yield:

a Generator of list of IDs, each in the length of batch_size

Return type:

Generator[List[str], None, None]

map(func, backend='thread', num_worker=None, show_progress=False, pool=None)#

Return an iterator that applies function to every element of iterable in parallel, yielding the results.

See also

  • To process on a batch of elements, please use map_batch();

  • To return a DocumentArray, please use apply().

Parameters:
  • func (Callable[[Document], T]) – a function that takes Document as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.

  • backend (str) –

    thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backing passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.

  • num_worker (Optional[int]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • show_progress (bool) – show a progress bar

  • pool (Union[Pool, ThreadPool, None]) – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.

Yield:

anything return from func

Return type:

Generator[T, None, None]

map_batch(func, batch_size, backend='thread', num_worker=None, shuffle=False, show_progress=False, pool=None)#

Return an iterator that applies function to every minibatch of iterable in parallel, yielding the results. Each element in the returned iterator is DocumentArray.

See also

  • To process single element, please use map();

  • To return DocumentArray, please use apply_batch().

Parameters:
  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller, default: 32)

  • shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.

  • func (Callable[[DocumentArray], T]) – a function that takes DocumentArray as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.

  • backend (str) –

    if to use multi-process or multi-thread as the parallelization backend. In general, if your func is IO-bound then perhaps thread is good enough. If your func is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.

  • num_worker (Optional[int]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • show_progress (bool) – show a progress bar

  • pool (Union[Pool, ThreadPool, None]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.

Yield:

anything return from func

Return type:

Generator[T, None, None]

split_by_tag(tag)#

Split the DocumentArray into multiple DocumentArray according to the tag value of each Document.

Parameters:

tag (str) – the tag name to split stored in tags.

Return type:

Dict[Any, DocumentArray]

Returns:

a dict where Documents with the same value on tag are grouped together, their orders are preserved from the original DocumentArray.

Note

If the tags of Document do not contains the specified tag, return an empty dict.