docarray.array.mixins.dataloader package#
Submodules#
Module contents#
- class docarray.array.mixins.dataloader.DataLoaderMixin[source]#
Bases:
object- classmethod dataloader(path, func, batch_size, protocol='protobuf', compress=None, backend='thread', num_worker=None, pool=None, show_progress=False)[source]#
Load array elements, batches and maps them with a function in parallel, finally yield the batch in DocumentArray
- Parameters:
path (
Union[str,Path]) – Path or filename where the data is stored.func (
Callable[[DocumentArray], T]) – a function that takesDocumentArrayas input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.batch_size (
int) – Size of each generated batch (except the last one, which might be smaller)protocol (
str) – protocol to usecompress (
Optional[str]) – compress algorithm to usebackend (
str) –if to use multi-process or multi-thread as the parallelization backend. In general, if your
funcis IO-bound then perhaps thread is good enough. If yourfuncis CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
funcmodify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.num_worker (
Optional[int]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.pool (
Union[Pool, ThreadPool,None]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.show_progress (
bool) – if set, show a progressbar
- Return type:
Generator[DocumentArray,None,None]- Returns: