docarray.array.mixins.dataloader package#
Submodules#
Module contents#
- class docarray.array.mixins.dataloader.DataLoaderMixin[source]#
Bases:
object
- classmethod dataloader(path, func, batch_size, protocol='protobuf', compress=None, backend='thread', num_worker=None, pool=None, show_progress=False)[source]#
Load array elements, batches and maps them with a function in parallel, finally yield the batch in DocumentArray
- Parameters:
path (
Union
[str
,Path
]) – Path or filename where the data is stored.func (
Callable
[[DocumentArray
], T]) – a function that takesDocumentArray
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller)protocol (
str
) – protocol to usecompress (
Optional
[str
]) – compress algorithm to usebackend (
str
) –if to use multi-process or multi-thread as the parallelization backend. In general, if your
func
is IO-bound then perhaps thread is good enough. If yourfunc
is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.pool (
Union
[Pool, ThreadPool,None
]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.show_progress (
bool
) – if set, show a progressbar
- Return type:
Generator
[DocumentArray
,None
,None
]- Returns: