docarray.array.mixins.dataloader package#

Submodules#

Module contents#

class docarray.array.mixins.dataloader.DataLoaderMixin[source]#

Bases: object

classmethod dataloader(path, func, batch_size, protocol='protobuf', compress=None, backend='thread', num_worker=None, pool=None, show_progress=False)[source]#

Load array elements, batches and maps them with a function in parallel, finally yield the batch in DocumentArray

Parameters:
  • path (Union[str, Path]) – Path or filename where the data is stored.

  • func (Callable[[DocumentArray], T]) – a function that takes DocumentArray as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.

  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller)

  • protocol (str) – protocol to use

  • compress (Optional[str]) – compress algorithm to use

  • backend (str) –

    if to use multi-process or multi-thread as the parallelization backend. In general, if your func is IO-bound then perhaps thread is good enough. If your func is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.

  • num_worker (Optional[int]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • pool (Union[Pool, ThreadPool, None]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.

  • show_progress (bool) – if set, show a progressbar

Return type:

Generator[DocumentArray, None, None]

Returns: