docarray.array.mixins.dataloader.helper module#
- class docarray.array.mixins.dataloader.helper.DocumentArrayLoader(path, protocol='protobuf', compress=None, show_progress=False)[source]#
Bases:
ParallelMixin
,GroupMixin
- apply(*args, **kwargs)#
Apply
func
to every Document in itself, return itself after modification.- Parameters:
func – a function that takes
Document
as input and outputsDocument
.backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
show_progress – show a progress bar
- Return type:
T
- Returns:
itself after modification
- apply_batch(*args, **kwargs)#
Batches itself into mini-batches, applies func to every mini-batch, and return itself after the modifications.
EXAMPLE USAGE
from docarray import Document, DocumentArray da = DocumentArray([Document(text='The cake is a lie') for _ in range(100)]) def func(doc): da.texts = [t.upper() for t in da.texts] return da da.apply_batch(func, batch_size=10) print(da.texts[:3])
['THE CAKE IS A LIE', 'THE CAKE IS A LIE', 'THE CAKE IS A LIE']
- Parameters:
func – a function that takes
DocumentArray
as input and outputsDocumentArray
.backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
batch_size – Size of each generated batch (except the last batch, which might be smaller). Default: 32
shuffle – If set, shuffle the Documents before dividing into minibatches.
show_progress – show a progress bar
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
- Return type:
T
- Returns:
itself after modification
- batch(batch_size, shuffle=False, show_progress=False)#
Creates a Generator that yields DocumentArray of size batch_size until docs is fully traversed along the traversal_path. The None docs are filtered out and optionally the docs can be filtered by checking for the existence of a Document attribute. Note, that the last batch might be smaller than batch_size.
- Parameters:
batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller, default: 32)shuffle (
bool
) – If set, shuffle the Documents before dividing into minibatches.show_progress (
bool
) – if set, show a progress bar when batching documents.
- Yield:
a Generator of DocumentArray, each in the length of batch_size
- Return type:
Generator
[DocumentArray
,None
,None
]
- batch_ids(batch_size, shuffle=False)#
Creates a Generator that yields lists of ids of size batch_size until self is fully traversed. Note, that the last batch might be smaller than batch_size.
- Parameters:
batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller)shuffle (
bool
) – If set, shuffle the Documents before dividing into minibatches.
- Yield:
a Generator of list of IDs, each in the length of batch_size
- Return type:
Generator
[List
[str
],None
,None
]
- map(func, backend='thread', num_worker=None, show_progress=False, pool=None)#
Return an iterator that applies function to every element of iterable in parallel, yielding the results.
See also
To process on a batch of elements, please use
map_batch()
;To return a
DocumentArray
, please useapply()
.
- Parameters:
func (
Callable
[[Document], T]) – a function that takesDocument
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.backend (
str
) –thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.show_progress (
bool
) – show a progress barpool (
Union
[Pool, ThreadPool,None
]) – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
- Yield:
anything return from
func
- Return type:
Generator
[T,None
,None
]
- map_batch(func, batch_size, backend='thread', num_worker=None, shuffle=False, show_progress=False, pool=None)#
Return an iterator that applies function to every minibatch of iterable in parallel, yielding the results. Each element in the returned iterator is
DocumentArray
.See also
To process single element, please use
map()
;To return
DocumentArray
, please useapply_batch()
.
- Parameters:
batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller, default: 32)shuffle (
bool
) – If set, shuffle the Documents before dividing into minibatches.func (
Callable
[[DocumentArray], T]) – a function that takesDocumentArray
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.backend (
str
) –if to use multi-process or multi-thread as the parallelization backend. In general, if your
func
is IO-bound then perhaps thread is good enough. If yourfunc
is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.show_progress (
bool
) – show a progress barpool (
Union
[Pool, ThreadPool,None
]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.
- Yield:
anything return from
func
- Return type:
Generator
[T,None
,None
]
- split_by_tag(tag)#
Split the DocumentArray into multiple DocumentArray according to the tag value of each Document.
- Parameters:
tag (
str
) – the tag name to split stored in tags.- Return type:
Dict
[Any
,DocumentArray
]- Returns:
a dict where Documents with the same value on tag are grouped together, their orders are preserved from the original
DocumentArray
.
Note
If the
tags
ofDocument
do not contains the specifiedtag
, return an empty dict.