docarray.array.mixins.parallel module#
- class docarray.array.mixins.parallel.ParallelMixin[source]#
Bases:
object
Helper functions that provide parallel map to
DocumentArray
- apply(func: Callable[[Document], Document], backend: str = 'thread', num_worker: Optional[int] = None, show_progress: bool = False, pool: Optional[Union[Pool, ThreadPool]] = None) T [source]#
Apply
func
to every Document in itself, return itself after modification.- Parameters:
func – a function that takes
Document
as input and outputsDocument
.backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
show_progress – show a progress bar
- Return type:
T
- Returns:
itself after modification
- map(func, backend='thread', num_worker=None, show_progress=False, pool=None)[source]#
Return an iterator that applies function to every element of iterable in parallel, yielding the results.
See also
To process on a batch of elements, please use
map_batch()
;To return a
DocumentArray
, please useapply()
.
- Parameters:
func (
Callable
[[Document], T]) – a function that takesDocument
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.backend (
str
) –thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.show_progress (
bool
) – show a progress barpool (
Union
[Pool, ThreadPool,None
]) – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
- Yield:
anything return from
func
- Return type:
Generator
[T,None
,None
]
- apply_batch(func: Callable[[DocumentArray], DocumentArray], batch_size: int, backend: str = 'thread', num_worker: Optional[int] = None, shuffle: bool = False, show_progress: bool = False, pool: Optional[Union[Pool, ThreadPool]] = None) T [source]#
Batches itself into mini-batches, applies func to every mini-batch, and return itself after the modifications.
EXAMPLE USAGE
from docarray import Document, DocumentArray da = DocumentArray([Document(text='The cake is a lie') for _ in range(100)]) def func(doc): da.texts = [t.upper() for t in da.texts] return da da.apply_batch(func, batch_size=10) print(da.texts[:3])
['THE CAKE IS A LIE', 'THE CAKE IS A LIE', 'THE CAKE IS A LIE']
- Parameters:
func – a function that takes
DocumentArray
as input and outputsDocumentArray
.backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
batch_size – Size of each generated batch (except the last batch, which might be smaller). Default: 32
shuffle – If set, shuffle the Documents before dividing into minibatches.
show_progress – show a progress bar
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
- Return type:
T
- Returns:
itself after modification
- map_batch(func, batch_size, backend='thread', num_worker=None, shuffle=False, show_progress=False, pool=None)[source]#
Return an iterator that applies function to every minibatch of iterable in parallel, yielding the results. Each element in the returned iterator is
DocumentArray
.See also
To process single element, please use
map()
;To return
DocumentArray
, please useapply_batch()
.
- Parameters:
batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller, default: 32)shuffle (
bool
) – If set, shuffle the Documents before dividing into minibatches.func (
Callable
[[DocumentArray], T]) – a function that takesDocumentArray
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.backend (
str
) –if to use multi-process or multi-thread as the parallelization backend. In general, if your
func
is IO-bound then perhaps thread is good enough. If yourfunc
is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.show_progress (
bool
) – show a progress barpool (
Union
[Pool, ThreadPool,None
]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.
- Yield:
anything return from
func
- Return type:
Generator
[T,None
,None
]