docarray.array.document module#
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, copy: bool = False, subindex_configs: Optional[Dict[str, None]] = None)[source]#
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'sqlite', config: Optional[Union[SqliteConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'weaviate', config: Optional[Union[WeaviateConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'annlite', config: Optional[Union[AnnliteConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'elasticsearch', config: Optional[Union[ElasticConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'redis', config: Optional[Union[RedisConfig, Dict]] = None)
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'milvus', config: Optional[Union[MilvusConfig, Dict]] = None)
- class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'opensearch', config: Optional[Union[OpenSearchConfig, Dict]] = None)
Bases:
AllMixins
,BaseDocumentArray
DocumentArray is a list-like container of
Document
objects.A DocumentArray can be used to store, embed, and retrieve
Document
objects.from docarray import Document, DocumentArray da = DocumentArray( [Document(text='The cake is a lie'), Document(text='Do a barrel roll!')] ) da.apply(Document.embed_feature_hashing) query = Document(text='Can i have some cake?').embed_feature_hashing() query.match(da, metric='jaccard', use_scipy=True) print(query.matches[:, ('text', 'scores__jaccard__value')])
[['The cake is a lie', 'Do a barrel roll!'], [0.9, 1.0]]
A DocumentArray can also embed its contents using a neural network, process them using an external Flow or Executor, and persist Documents in a Document Store for fast vector search:
from docarray import Document, DocumentArray import numpy as np n_dim = 3 metric = 'Euclidean' # initialize a DocumentArray with ANNLiter Document Store da = DocumentArray( storage='annlite', config={'n_dim': n_dim, 'columns': [('price', 'float')], 'metric': metric}, ) # add Documents to the DocumentArray with da: da.extend( [ Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i}) for i in range(10) ] ) # perform vector search np_query = np.ones(n_dim) * 8 results = da.find(np_query)
See also
For further details, see our user guide.
- append(value)#
S.append(value) – append value to the end of the sequence
- apply(*args, **kwargs)#
Apply
func
to every Document in itself, return itself after modification.- Parameters:
func – a function that takes
Document
as input and outputsDocument
.backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
show_progress – show a progress bar
- Return type:
T
- Returns:
itself after modification
- apply_batch(*args, **kwargs)#
Batches itself into mini-batches, applies func to every mini-batch, and return itself after the modifications.
EXAMPLE USAGE
from docarray import Document, DocumentArray da = DocumentArray([Document(text='The cake is a lie') for _ in range(100)]) def func(doc): da.texts = [t.upper() for t in da.texts] return da da.apply_batch(func, batch_size=10) print(da.texts[:3])
['THE CAKE IS A LIE', 'THE CAKE IS A LIE', 'THE CAKE IS A LIE']
- Parameters:
func – a function that takes
DocumentArray
as input and outputsDocumentArray
.backend –
thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.
batch_size – Size of each generated batch (except the last batch, which might be smaller). Default: 32
shuffle – If set, shuffle the Documents before dividing into minibatches.
show_progress – show a progress bar
pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
- Return type:
T
- Returns:
itself after modification
- batch(batch_size, shuffle=False, show_progress=False)#
Creates a Generator that yields DocumentArray of size batch_size until docs is fully traversed along the traversal_path. The None docs are filtered out and optionally the docs can be filtered by checking for the existence of a Document attribute. Note, that the last batch might be smaller than batch_size.
- Parameters:
batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller, default: 32)shuffle (
bool
) – If set, shuffle the Documents before dividing into minibatches.show_progress (
bool
) – if set, show a progress bar when batching documents.
- Yield:
a Generator of DocumentArray, each in the length of batch_size
- Return type:
Generator
[DocumentArray
,None
,None
]
- batch_ids(batch_size, shuffle=False)#
Creates a Generator that yields lists of ids of size batch_size until self is fully traversed. Note, that the last batch might be smaller than batch_size.
- Parameters:
batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller)shuffle (
bool
) – If set, shuffle the Documents before dividing into minibatches.
- Yield:
a Generator of list of IDs, each in the length of batch_size
- Return type:
Generator
[List
[str
],None
,None
]
- property blobs: Optional[List[bytes]]#
Get the blob attribute of all Documents.
- Return type:
Optional
[List
[bytes
]]- Returns:
a list of blobs
- clear() None -- remove all items from S #
- static cloud_delete(name)#
Delete a DocumentArray from the cloud. :type name:
str
:param name: the name of the DocumentArray to delete.- Return type:
None
- static cloud_list(show_table=False)#
List all available arrays in the cloud.
- Parameters:
show_table (
bool
) – if true, show the table of the arrays.- Return type:
List
[str
]- Returns:
List of available DocumentArray’s names.
- classmethod cloud_pull(cls, name, show_progress=False, local_cache=True, *args, **kwargs)#
Pulling a
DocumentArray
from Jina Cloud Service to local.- Parameters:
name (
str
) – the upload name set duringpush()
show_progress (
bool
) – if to show a progress bar on pullinglocal_cache (
bool
) – store the downloaded DocumentArray to local folder
- Return type:
T
- Returns:
a
DocumentArray
object
- cloud_push(name, show_progress=False, public=True, branding=None)#
Push this DocumentArray object to Jina Cloud which can be later retrieved via
push()
Note
Push with the same
name
will override the existing content.Kinda like a public clipboard where everyone can override anyone’s content. So to make your content survive longer, you may want to use longer & more complicated name.
The lifetime of the content is not promised atm, could be a day, could be a week. Do not use it for persistence. Only use this full temporary transmission/storage/clipboard.
- Parameters:
name (
str
) – a name that later can be used for retrieve thisDocumentArray
.show_progress (
bool
) – if to show a progress bar on pullingpublic (
bool
) – by default anyone can pull a DocumentArray if they know its name. Setting this to False will allow only the creator to pull it. This feature of course you to login first.branding (
Optional
[Dict
]) – a dict of branding information to be sent to Jina Cloud. {“icon”: “emoji”, “background”: “#fff”}
- Return type:
Dict
- property contents: Optional[Union[Sequence[DocumentContentType], ArrayType]]#
Get the
content
of all Documents.- Return type:
Union
[Sequence
[DocumentContentType], ArrayType,None
]- Returns:
a list of texts, blobs or
ArrayType
- count(value) integer -- return number of occurrences of value #
- classmethod dataloader(path, func, batch_size, protocol='protobuf', compress=None, backend='thread', num_worker=None, pool=None, show_progress=False)#
Load array elements, batches and maps them with a function in parallel, finally yield the batch in DocumentArray
- Parameters:
path (
Union
[str
,Path
]) – Path or filename where the data is stored.func (
Callable
[[DocumentArray
], T]) – a function that takesDocumentArray
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller)protocol (
str
) – protocol to usecompress (
Optional
[str
]) – compress algorithm to usebackend (
str
) –if to use multi-process or multi-thread as the parallelization backend. In general, if your
func
is IO-bound then perhaps thread is good enough. If yourfunc
is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.pool (
Union
[Pool, ThreadPool,None
]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.show_progress (
bool
) – if set, show a progressbar
- Return type:
Generator
[DocumentArray
,None
,None
]- Returns:
- embed(embed_model, device='cpu', batch_size=256, to_numpy=False, collate_fn=None)#
Fill
embedding
of Documents inplace by using embed_model For the evaluation of a model, one can directly use theembed_and_evaluate()
function.- Parameters:
embed_model (AnyDNN) – The embedding model written in Keras/Pytorch/Paddle
device (
str
) – The computational device for embed_model, can be either cpu or cuda.batch_size (
int
) – Number of Documents in a batch for embeddingto_numpy (
bool
) – If to store embeddings back to Document innumpy.ndarray
or original framework format.collate_fn (
Optional
[CollateFnType]) – create a mini-batch of Input(s) from the given DocumentArray. Default built-in collate_fn is to use the tensors of the documents.
- Return type:
T
- Returns:
itself after modified.
- embed_and_evaluate(metrics, index_data=None, ground_truth=None, metric_names=None, strict=True, label_tag='label', embed_models=None, embed_funcs=None, device='cpu', batch_size=256, collate_fns=None, distance='cosine', limit=20, normalization=None, exclude_self=False, use_scipy=False, num_worker=1, match_batch_size=100000, query_sample_size=1000, **kwargs)#
Computes ranking evaluation metrics for a given DocumentArray. This function does embedding and matching in the same turn. Thus, you don’t need to call
embed
andmatch
before it. Instead, it embeds the documents in self (and index_data when provided`) and compute the nearest neighbour itself. This might be done in batches for the index_data object to reduce the memory consumption of the evlauation process. The evaluation itself can be done against a ground_truth DocumentArray or on the basis of labels like it is possible with the :func:evaluate
function.- Parameters:
metrics (
List
[Union
[str
,Callable
[...
,float
]]]) – List of metric names or metric functions to be computedindex_data (
Optional
[DocumentArray
]) – The other DocumentArray to match against, if not given, self will be matched against itself. This means that every document in will be compared to all other documents in self to determine the nearest neighbors.ground_truth (
Optional
[DocumentArray
]) – The ground_truth DocumentArray that the DocumentArray compares to.metric_names (
Optional
[str
]) – If provided, the results of the metrics computation will be stored in the evaluations field of each Document with this names. If not provided, the names will be derived from the metric function names.strict (
bool
) – If set, then left and right sides are required to be fully aligned: on the length, and on the semantic of length. These are preventing you to evaluate on irrelevant matches accidentally.label_tag (
str
) – Specifies the tag which contains the labels.embed_models (
Union
[AnyDNN,Tuple
[AnyDNN, AnyDNN],None
]) – One or two embedding model written in Keras / Pytorch / Paddle for embedding self and index_data.embed_funcs (
Union
[Callable
,Tuple
[Callable
,Callable
],None
]) – As an alternative to embedding models, custom embedding functions can be provided.device (
str
) – the computational device for embed_models, and the matching can be either cpu or cuda.batch_size (
Union
[int
,Tuple
[int
,int
]]) – Number of documents in a batch for embedding.collate_fns (
Union
[CollateFnType,None
,Tuple
[Optional
[CollateFnType],Optional
[CollateFnType]]]) – For each embedding function the respective collate function creates a mini-batch of input(s) from the given DocumentArray. If not provided a default built-in collate_fn uses the tensors of the documents to create input batches.distance (
Union
[str
,Callable
[[ArrayType, ArrayType],ndarray
]]) – The distance metric.limit (
Union
[int
,float
,None
]) – The maximum number of matches, when not given defaults to 20.normalization (
Optional
[Tuple
[float
,float
]]) – A tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].exclude_self (
bool
) – If set, Documents inindex_data
with sameid
as the left-hand values will not be considered as matches.use_scipy (
bool
) – if set, usescipy
as the computation backend. Note,scipy
does not support distance on sparse matrix.num_worker (
int
) – Specifies the number of workers for the execution of the match function.kwargs – Additional keyword arguments to be passed to the metric functions.
query_sample_size (
int
) – For a large number of documents in self the evaluation becomes infeasible, especially, if index_data is large. Therefore, queries are sampled if the number of documents in self exceeds query_sample_size. Usually, this has only small impact on the mean metric values returned by this function. To prevent sampling, you can set query_sample_size to None.
- Parma match_batch_size:
The number of documents which are embedded and matched at once. Set this value to a lower value, if you experience high memory consumption.
- Return type:
Union
[float
,List
[float
],None
]- Returns:
A dictionary which stores for each metric name the average evaluation score.
- property embeddings: Optional[ArrayType]#
Return a
ArrayType
stacking all the embedding attributes as rows.- Return type:
Optional
[ArrayType]- Returns:
a
ArrayType
of embedding
- classmethod empty(size=0, *args, **kwargs)#
Create a
DocumentArray
object withsize
emptyDocument
objects.- Parameters:
size (
int
) – the number of empty Documents in this container- Return type:
T
- Returns:
a
DocumentArray
object
- evaluate(metrics, ground_truth=None, hash_fn=None, metric_names=None, strict=True, label_tag='label', num_relevant_documents_per_label=None, **kwargs)#
Compute ranking evaluation metrics for a given DocumentArray when compared with a ground truth.
If one provides a ground_truth DocumentArray that is structurally identical to self, this function compares the matches of documents inside the DocumentArray to this ground_truth. Alternatively, one can directly annotate the documents by adding labels in the form of tags with the key specified in the label_tag attribute. Those tags need to be added to self as well as to the documents in the matches properties.
This method will fill the evaluations field of Documents inside this DocumentArray and will return the average of the computations
- Parameters:
metrics (
List
[Union
[str
,Callable
[...
,float
]]]) – List of metric names or metric functions to be computedground_truth (
Optional
[DocumentArray
]) – The ground_truth DocumentArray that the DocumentArray compares to.hash_fn (
Optional
[Callable
[[Document
],str
]]) – For the evaluation against a ground_truth DocumentArray, this function is used for generating hashes which are used to compare the documents. If not given,Document.id
is used.metric_names (
Optional
[List
[str
]]) – If provided, the results of the metrics computation will be stored in the evaluations field of each Document with this names. If not provided, the names will be derived from the metric function names.strict (
bool
) – If set, then left and right sides are required to be fully aligned: on the length, and on the semantic of length. These are preventing you to evaluate on irrelevant matches accidentally.label_tag (
str
) – Specifies the tag which contains the labels.num_relevant_documents_per_label (
Optional
[Dict
[Any
,int
]]) – Some metrics, e.g., recall@k, require the number of relevant documents. To apply those to a labeled dataset, one can provide a dictionary which maps labels to the total number of documents with this label.kwargs – Additional keyword arguments to be passed to the metric functions.
- Return type:
Dict
[str
,float
]- Returns:
A dictionary which stores for each metric name the average evaluation score.
- extend(values)#
S.extend(iterable) – extend sequence by appending elements from the iterable
- find(query=None, metric='cosine', limit=20, metric_name=None, exclude_self=False, filter=None, only_id=False, index='text', return_root=False, on=None, **kwargs)#
Returns matching Documents given an input query. If the query is a DocumentArray, Document or ArrayType, exhaustive or approximate nearest neighbor search will be performed depending on whether the storage backend supports ANN. Furthermore, if filter is not None, pre-filtering will be applied along with vector search. If the query is a dict object or, query is None and filter is not None, Documents will be filtered and all matching Documents that match the filter will be returned. In this case, query (if it’s dict) or filter will be used for filtering. The object must follow the backend-specific filter format if the backend supports filtering or DocArray’s query language format. In the latter case, filtering will be applied in the client side not the backend side. If the query is a string or list of strings, a search by text will be performed if the backend supports indexing and searching text fields. If not, a NotImplementedError will be raised.
- Parameters:
query (
Union
[DocumentArray
,Document
, ArrayType,Dict
,str
,List
[str
],None
]) – the input query to search bylimit (
Union
[int
,float
,None
]) – the maximum number of matches, when not given defaults to 20.metric_name (
Optional
[str
]) – if provided, then match result will be marked with this string.metric (
Union
[str
,Callable
[[ArrayType, ArrayType],ndarray
]]) – the distance metric.exclude_self (
bool
) – if set, Documents in results with sameid
as the query values will not be considered as matches. This is only applied when the input query is Document or DocumentArray.filter (
Union
[Dict
,str
,None
]) – filter query used for pre-filtering or filteringonly_id (
bool
) – if set, then returning matches will only containid
index (
str
) – if the query is a string, text search will be performed on the index field, otherwise, this parameter is ignored. By default, the Document text attribute will be used for search, otherwise the tag field specified by index will be used. You can only use this parameter if the storage backend supports searching by text.return_root (
Optional
[bool
]) – if set, then the root-level DocumentArray will be returnedon (
Optional
[str
]) – specifies a subindex to search on. If set, the returned DocumentArray will be retrieved from the given subindex.kwargs – other kwargs.
- Return type:
Union
[DocumentArray
,List
[DocumentArray
]]- Returns:
a list of DocumentArrays containing the closest Document objects for each of the queries in query.
- flatten()#
Flatten all nested chunks and matches into one
DocumentArray
.Note
Flatten an already flattened DocumentArray will have no effect.
- Return type:
- Returns:
a flattened
DocumentArray
object.
- classmethod from_base64(data, protocol='pickle-array', compress=None, _show_progress=False, *args, **kwargs)#
- Return type:
T
- classmethod from_bytes(data, protocol='pickle-array', compress=None, _show_progress=False, *args, **kwargs)#
- Return type:
T
- classmethod from_csv(*args, **kwargs)#
# noqa: DAR101 # noqa: DAR102 # noqa: DAR201
- Return type:
T
- classmethod from_dataframe(df, *args, **kwargs)#
Import a
DocumentArray
from apandas.DataFrame
object.- Parameters:
df (DataFrame) – a
pandas.DataFrame
object.- Return type:
T
- Returns:
a
DocumentArray
object
- classmethod from_dict(values, protocol='jsonschema', **kwargs)#
- Return type:
T
- classmethod from_files(*args, **kwargs)#
# noqa: DAR101 # noqa: DAR102 # noqa: DAR201
- Return type:
T
- classmethod from_huggingface_datasets(*args, **kwargs)#
# noqa: DAR101 # noqa: DAR102 # noqa: DAR201
- Return type:
T
- classmethod from_json(file, protocol='jsonschema', **kwargs)#
- Return type:
T
- classmethod from_lines(*args, **kwargs)#
# noqa: DAR101 # noqa: DAR102 # noqa: DAR201
- Return type:
T
- classmethod from_list(values, protocol='jsonschema', **kwargs)#
- Return type:
T
- classmethod from_ndarray(*args, **kwargs)#
# noqa: DAR101 # noqa: DAR102 # noqa: DAR201
- Return type:
T
- classmethod from_ndjson(*args, **kwargs)#
# noqa: DAR101 # noqa: DAR102 # noqa: DAR201
- Return type:
T
- classmethod from_protobuf(pb_msg)#
- Return type:
T
- classmethod from_pydantic_model(model)#
Convert a list of PydanticDocument into DocumentArray
- Parameters:
model (
List
[BaseModel]) – the list of pydantic data model objects that represents a DocumentArray- Return type:
T
- Returns:
a DocumentArray
- classmethod from_strawberry_type(model)#
Convert a list of Strawberry into DocumentArray
- Parameters:
model (
List
[StrawberryDocument]) – the list of strawberry type objects that represents a DocumentArray- Return type:
T
- Returns:
a DocumentArray
- classmethod get_json_schema(indent=2)#
Return a JSON Schema of DocumentArray class.
- Return type:
str
- get_vocabulary(min_freq=1, text_attrs=('text',))#
Get the text vocabulary in a dict that maps from the word to the index from all Documents.
- Parameters:
text_attrs (
Tuple
[str
,...
]) – the textual attributes where vocabulary will be derived frommin_freq (
int
) – the minimum word frequency to be considered into the vocabulary.
- Return type:
Dict
[str
,int
]- Returns:
a vocabulary in dictionary where key is the word, value is the index. The value is 2-index, where 0 is reserved for padding, 1 is reserved for unknown token.
- index(value[, start[, stop]]) integer -- return first index of value. #
Raises ValueError if the value is not present.
Supporting start and stop arguments is optional, but recommended.
- abstract insert(index, value)#
S.insert(index, value) – insert value before index
- classmethod load(file, file_format='binary', encoding='utf-8', **kwargs)#
Load array elements from a JSON or a binary file, or a CSV file.
- Parameters:
file (
Union
[str
,TextIO
,BinaryIO
]) – File or filename to which the data is saved.file_format (
str
) – json or binary or csv. JSON and CSV files are human-readable, but binary format gives much smaller size and faster save/load speed. CSV file has very limited compatability, complex DocumentArray with nested structure can not be restored from a CSV file.encoding (
str
) – encoding used to load data from a file (it only applies to JSON and CSV format). By default,utf-8
is used.
- Return type:
T
- Returns:
the loaded DocumentArray object
- classmethod load_binary(file, protocol='pickle-array', compress=None, _show_progress=False, streaming=False, *args, **kwargs)#
Load array elements from a compressed binary file.
- Parameters:
file (
Union
[str
,BinaryIO
,bytes
,Path
]) – File or filename or serialized bytes where the data is stored.protocol (
str
) – protocol to usecompress (
Optional
[str
]) – compress algorithm to use_show_progress (
bool
) – show progress bar, only works when protocol is pickle or protobufstreaming (
bool
) – if True returns a generator over Document objects.
In case protocol is pickle the Documents are streamed from disk to save memory usage :rtype:
Union
[DocumentArray,Generator
[Document,None
,None
]] :return: a DocumentArray objectNote
If file is str it can specify protocol and compress as file extensions. This functionality assumes file=file_name.$protocol.$compress where $protocol and $compress refer to a string interpolation of the respective protocol and compress methods. For example if file=my_docarray.protobuf.lz4 then the binary data will be loaded assuming protocol=protobuf and compress=lz4.
- classmethod load_csv(file, field_resolver=None, encoding='utf-8')#
Load array elements from a binary file.
- Parameters:
file (
Union
[str
,TextIO
]) – File or filename to which the data is saved.field_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined in JSON, dict to the field names defined in Document.encoding (
str
) – encoding used to read a CSV file. By default,utf-8
is used.
- Return type:
T
- Returns:
a DocumentArray object
- classmethod load_json(file, protocol='jsonschema', encoding='utf-8', **kwargs)#
Load array elements from a JSON file.
- Parameters:
file (
Union
[str
,TextIO
]) – File or filename or a JSON string to which the data is saved.protocol (
str
) – jsonschema or protobufencoding (
str
) – encoding used to load data from a JSON file. By default,utf-8
is used.
- Return type:
T
- Returns:
a DocumentArrayLike object
- map(func, backend='thread', num_worker=None, show_progress=False, pool=None)#
Return an iterator that applies function to every element of iterable in parallel, yielding the results.
See also
To process on a batch of elements, please use
map_batch()
;To return a
DocumentArray
, please useapply()
.
- Parameters:
func (
Callable
[[Document], T]) – a function that takesDocument
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.backend (
str
) –thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your
func
is IO-bound then thread is a good choice. If yourfunc
is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.show_progress (
bool
) – show a progress barpool (
Union
[Pool, ThreadPool,None
]) – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.
- Yield:
anything return from
func
- Return type:
Generator
[T,None
,None
]
- map_batch(func, batch_size, backend='thread', num_worker=None, shuffle=False, show_progress=False, pool=None)#
Return an iterator that applies function to every minibatch of iterable in parallel, yielding the results. Each element in the returned iterator is
DocumentArray
.See also
To process single element, please use
map()
;To return
DocumentArray
, please useapply_batch()
.
- Parameters:
batch_size (
int
) – Size of each generated batch (except the last one, which might be smaller, default: 32)shuffle (
bool
) – If set, shuffle the Documents before dividing into minibatches.func (
Callable
[[DocumentArray], T]) – a function that takesDocumentArray
as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.backend (
str
) –if to use multi-process or multi-thread as the parallelization backend. In general, if your
func
is IO-bound then perhaps thread is good enough. If yourfunc
is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.Warning
When using process backend, you should not expect
func
modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.num_worker (
Optional
[int
]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.show_progress (
bool
) – show a progress barpool (
Union
[Pool, ThreadPool,None
]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.
- Yield:
anything return from
func
- Return type:
Generator
[T,None
,None
]
- match(darray, metric='cosine', limit=20, normalization=None, metric_name=None, batch_size=None, exclude_self=False, filter=None, only_id=False, use_scipy=False, device='cpu', num_worker=1, on=None, **kwargs)#
Compute embedding based nearest neighbour in another for each Document in self, and store results in matches. For the purpose of evaluation, one can also directly use the
embed_and_evaluate()
function. .. note:'cosine', 'euclidean', 'sqeuclidean' are supported natively without extra dependency. You can use other distance metric provided by ``scipy``, such as `braycurtis`, `canberra`, `chebyshev`, `cityblock`, `correlation`, `cosine`, `dice`, `euclidean`, `hamming`, `jaccard`, `jensenshannon`, `kulsinski`, `mahalanobis`, `matching`, `minkowski`, `rogerstanimoto`, `russellrao`, `seuclidean`, `sokalmichener`, `sokalsneath`, `sqeuclidean`, `wminkowski`, `yule`. To use scipy metric, please set ``use_scipy=True``.
To make all matches values in [0, 1], use
dA.match(dB, normalization=(0, 1))
- To invert the distance as score and make all values in range [0, 1],
use
dA.match(dB, normalization=(1, 0))
. Note, hownormalization
differs from the previous.
If a custom metric distance is provided. Make sure that it returns scores as distances and not similarity, meaning the smaller the better.
- Parameters:
darray (DocumentArray) – the other DocumentArray to match against
metric (
Union
[str
,Callable
[[ArrayType, ArrayType],ndarray
]]) – the distance metriclimit (
Union
[int
,float
,None
]) – the maximum number of matches, when not given defaults to 20.normalization (
Optional
[Tuple
[float
,float
]]) – a tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].metric_name (
Optional
[str
]) – if provided, then match result will be marked with this string.batch_size (
Optional
[int
]) – if provided, thendarray
is loaded in batches, where each of them is at mostbatch_size
elements. When darray is big, this can significantly speedup the computation.exclude_self (
bool
) – if set, Documents indarray
with sameid
as the left-hand values will not be considered as matches.filter (
Optional
[Dict
]) – filter query used for pre-filteringonly_id (
bool
) – if set, then returning matches will only containid
use_scipy (
bool
) – if set, usescipy
as the computation backend. Note,scipy
does not support distance on sparse matrix.device (
str
) – the computational device for.match()
, can be either cpu or cuda.num_worker (
Optional
[int
]) –the number of parallel workers. If not given, then the number of CPUs in the system will be used.
Note
This argument is only effective when
batch_size
is set.on (
Optional
[str
]) – specifies a subindex to search on. If set, the returned DocumentArray will be retrieved from the given subindex.kwargs – other kwargs.
- Return type:
None
- plot_embeddings(title='MyDocumentArray', path=None, image_sprites=False, min_image_size=16, channel_axis=-1, start_server=True, host='127.0.0.1', port=None, image_source='tensor', exclude_fields_metas=None)#
Interactively visualize
embeddings
using the Embedding Projector and store the visualization informations.- Parameters:
title (
str
) – the title of this visualization. If you want to compare multiple embeddings at the same time, make sure to give different names each time and setpath
to the same value.host (
str
) – if set, bind the embedding-projector frontend to given host. Otherwise localhost is used.port (
Optional
[int
]) – if set, run the embedding-projector frontend at given port. Otherwise a random port is used.image_sprites (
bool
) – if set, visualize the dots usinguri
andtensor
.path (
Optional
[str
]) – if set, then append the visualization to an existing folder, where you can compare multiple embeddings at the same time. Make sure to use a differenttitle
each time .min_image_size (
int
) – only used when image_sprites=True. the minimum size of the imagechannel_axis (
int
) – only used when image_sprites=True. the axis id of the color channel,-1
indicates the color channel info at the last axisstart_server (
bool
) – if set, start a HTTP server and open the frontend directly. Otherwise, you need to rely onreturn
path and serve by yourself.image_source (
str
) – specify where the image comes from, can beuri
ortensor
. empty tensor will fallback to uriexclude_fields_metas (
Optional
[List
[str
]]) – specify the fields that you want to exclude from metadata tsv file
- Return type:
str
- Returns:
the path to the embeddings visualization info.
- plot_image_sprites(output=None, canvas_size=512, min_size=16, channel_axis=-1, image_source='tensor', skip_empty=False, show_progress=False, show_index=False, fig_size=(10, 10), keep_aspect_ratio=False)#
Generate a sprite image for all image tensors in this DocumentArray-like object.
An image sprite is a collection of images put into a single image. It is always square-sized. Each sub-image is also square-sized and equally-sized.
- Parameters:
output (
Optional
[str
]) – Optional path to store the visualization. If not given, show in UIcanvas_size (
int
) – the size of the canvasmin_size (
int
) – the minimum size of the imagechannel_axis (
int
) – the axis id of the color channel,-1
indicates the color channel info at the last axisimage_source (
str
) – specify where the image comes from, can beuri
ortensor
. empty tensor will fallback to uriskip_empty (
bool
) – skip Document who has no .uri or .tensor.show_index (
bool
) – show the index on the top-right corner of every imagefig_size (
Optional
[Tuple
[int
,int
]]) – the size of the figureshow_progress (
bool
) – show a progressbar while plotting.keep_aspect_ratio (
bool
) – preserve the aspect ratio of the image by using the aspect ratio of the first image in self.
- Return type:
None
- pop([index]) item -- remove and return item at index (default last). #
Raise IndexError if list is empty or index is out of range.
- post(host, show_progress=False, batch_size=None, parameters=None, **kwargs)#
Posting itself to a remote Flow/Sandbox and get the modified DocumentArray back
- Parameters:
host (
str
) – a host string. Can be one of the following: - grpc://192.168.0.123:8080/endpoint - ws://192.168.0.123:8080/endpoint - http://192.168.0.123:8080/endpoint - jinahub://Hello/endpoint - jinahub+docker://Hello/endpoint - jinahub+docker://Hello/v0.0.1/endpoint - jinahub+docker://Hello/latest/endpoint - jinahub+sandbox://Hello/endpointshow_progress (
bool
) – if to show a progressbarbatch_size (
Optional
[int
]) – number of Document on each requestparameters (
Optional
[Dict
]) – parameters to send in the request
- Return type:
- Returns:
the new DocumentArray returned from remote
- classmethod pull(cls, name, show_progress=False, local_cache=True, *args, **kwargs)#
Pulling a
DocumentArray
from Jina Cloud Service to local.- Parameters:
name (
str
) – the upload name set duringpush()
show_progress (
bool
) – if to show a progress bar on pullinglocal_cache (
bool
) – store the downloaded DocumentArray to local folder
- Return type:
T
- Returns:
a
DocumentArray
object
- push(name, show_progress=False, public=True, branding=None)#
Push this DocumentArray object to Jina Cloud which can be later retrieved via
push()
Note
Push with the same
name
will override the existing content.Kinda like a public clipboard where everyone can override anyone’s content. So to make your content survive longer, you may want to use longer & more complicated name.
The lifetime of the content is not promised atm, could be a day, could be a week. Do not use it for persistence. Only use this full temporary transmission/storage/clipboard.
- Parameters:
name (
str
) – a name that later can be used for retrieve thisDocumentArray
.show_progress (
bool
) – if to show a progress bar on pullingpublic (
bool
) – by default anyone can pull a DocumentArray if they know its name. Setting this to False will allow only the creator to pull it. This feature of course you to login first.branding (
Optional
[Dict
]) – a dict of branding information to be sent to Jina Cloud. {“icon”: “emoji”, “background”: “#fff”}
- Return type:
Dict
- reduce(other)#
Reduces other and the current DocumentArray into one DocumentArray in-place. Changes are applied to the current DocumentArray. Reducing 2 DocumentArrays consists in adding Documents in the second DocumentArray to the first DocumentArray if they do not exist. If a Document exists in both DocumentArrays, the data properties are merged with priority to the first Document (that is, to the current DocumentArray’s Document). The matches and chunks are also reduced in the same way. :type other: T :param other: DocumentArray :rtype: T :return: DocumentArray
- reduce_all(others)#
Reduces a list of DocumentArrays and this DocumentArray into one DocumentArray. Changes are applied to this DocumentArray in-place.
Reduction consists in reducing this DocumentArray with every DocumentArray in others sequentially using
DocumentArray
.:method:reduce. The resulting DocumentArray contains Documents of all DocumentArrays. If a Document exists in many DocumentArrays, data properties are merged with priority to the left-most DocumentArrays (that is, if a data attribute is set in a Document belonging to many DocumentArrays, the attribute value of the left-most DocumentArray is kept). Matches and chunks of a Document belonging to many DocumentArrays are also reduced in the same way. Other non-data properties are ignored.Note
- Matches are not kept in a sorted order when they are reduced. You might want to re-sort them in a later
step.
The final result depends on the order of DocumentArrays when applying reduction.
- Parameters:
others (
List
[T]) – List of DocumentArrays to be reduced- Return type:
T
- Returns:
the resulting DocumentArray
- remove(value)#
S.remove(value) – remove first occurrence of value. Raise ValueError if the value is not present.
- reverse()#
S.reverse() – reverse IN PLACE
- sample(k, seed=None)#
random sample k elements from
DocumentArray
without replacement.- Parameters:
k (
int
) – Number of elements to sample from the document array.seed (
Optional
[int
]) – initialize the random number generator, by default is None. If set will save the state of the random function to produce certain outputs.
- Return type:
- Returns:
A sampled list of
Document
represented asDocumentArray
.
- save(file, file_format='binary', encoding='utf-8')#
Save array elements into a JSON, a binary file or a CSV file.
- Parameters:
file (
Union
[str
,TextIO
,BinaryIO
]) – File or filename to which the data is saved.file_format (
str
) – json or binary or csv. JSON and CSV files are human-readable, but binary format gives much smaller size and faster save/load speed. Note that, CSV file has very limited compatability, complex DocumentArray with nested structure can not be restored from a CSV file.encoding (
str
) – encoding used to save data into a file (it only applies to JSON and CSV format). By default,utf-8
is used.
- Return type:
None
- save_binary(file, protocol='pickle-array', compress=None)#
Save array elements into a binary file.
- Parameters:
file (
Union
[str
,BinaryIO
]) – File or filename to which the data is saved.protocol (
str
) – protocol to usecompress (
Optional
[str
]) –compress algorithm to use
Note
If file is str it can specify protocol and compress as file extensions. This functionality assumes file=file_name.$protocol.$compress where $protocol and $compress refer to a string interpolation of the respective protocol and compress methods. For example if file=my_docarray.protobuf.lz4 then the binary data will be created using protocol=protobuf and compress=lz4.
Comparing to
save_json()
, it is faster and the file is smaller, but not human-readable.Note
To get a binary presentation in memory, use
bytes(...)
.- Return type:
None
- save_csv(file, flatten_tags=True, exclude_fields=None, dialect='excel', with_header=True, encoding='utf-8')#
Save array elements into a CSV file.
- Parameters:
file (
Union
[str
,TextIO
]) – File or filename to which the data is saved.flatten_tags (
bool
) – if set, then all fields inDocument.tags
will be flattened intotag__fieldname
and stored as separated columns. It is useful whentags
contain a lot of information.exclude_fields (
Optional
[Sequence
[str
]]) – if set, those fields wont show up in the output CSVdialect (
Union
[str
,Dialect
]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be acsv.Dialect
class that groups specific formatting parameters together.encoding (
str
) – encoding used to save the data into a CSV file. By default,utf-8
is used.
- Return type:
None
- save_embeddings_csv(file, encoding='utf-8', **kwargs)#
Save embeddings to a CSV file
This function utilizes
numpy.savetxt()
internal.- Parameters:
file (
Union
[str
,TextIO
]) – File or filename to which the data is saved.encoding (
str
) – encoding used to save the data into a file. By default,utf-8
is used.kwargs – extra kwargs will be passed to
numpy.savetxt()
.
- Return type:
None
- save_gif(output, channel_axis=-1, duration=200, size_ratio=1.0, inline_display=False, image_source='tensor', skip_empty=False, show_index=False, show_progress=False)#
Save a gif of the DocumentArray. Each frame corresponds to a Document.uri/.tensor in the DocumentArray.
- Parameters:
output (
str
) – the file path to save the gif to.channel_axis (
int
) – the color channel axis of the tensor.duration (
int
) – the duration of each frame in milliseconds.size_ratio (
float
) – the size ratio of each frame.inline_display (
bool
) – if to show the gif in Jupyter notebook.image_source (
str
) – the source of the image in Document atribute.skip_empty (
bool
) – if to skip empty documents.show_index (
bool
) – if to show the index of the document in the top-right corner.show_progress (
bool
) – if to show a progress bar.
- Return type:
None
- Returns:
- save_json(file, protocol='jsonschema', encoding='utf-8', **kwargs)#
Save array elements into a JSON file.
Comparing to
save_binary()
, it is human-readable but slower to save/load and the file size larger.- Parameters:
file (
Union
[str
,TextIO
]) – File or filename to which the data is saved.protocol (
str
) – jsonschema or protobufencoding (
str
) – encoding used to save data into a JSON file. By default,utf-8
is used.
- Return type:
None
- shuffle(seed=None)#
Randomly shuffle documents within the
DocumentArray
.- Parameters:
seed (
Optional
[int
]) – initialize the random number generator, by default is None. If set will save the state of the random function to produce certain outputs.- Return type:
- Returns:
The shuffled list of
Document
represented asDocumentArray
.
- split_by_tag(tag)#
Split the DocumentArray into multiple DocumentArray according to the tag value of each Document.
- Parameters:
tag (
str
) – the tag name to split stored in tags.- Return type:
Dict
[Any
,DocumentArray
]- Returns:
a dict where Documents with the same value on tag are grouped together, their orders are preserved from the original
DocumentArray
.
Note
If the
tags
ofDocument
do not contains the specifiedtag
, return an empty dict.
- summary()#
Print the structure and attribute summary of this DocumentArray object.
Warning
Calling {meth}`.summary` on large DocumentArray can be slow.
- property tensors: Optional[ArrayType]#
Return a
ArrayType
stacking alltensor
.The tensor attributes are stacked together along a newly created first dimension (as if you would stack using
np.stack(X, axis=0)
).Warning
This operation assumes all tensors have the same shape and dtype. All dtype and shape values are assumed to be equal to the values of the first element in the DocumentArray
- Return type:
Optional
[ArrayType]- Returns:
a
ArrayType
of tensors
- property texts: Optional[List[str]]#
Get
text
of all Documents- Return type:
Optional
[List
[str
]]- Returns:
a list of texts
- to_base64(protocol='pickle-array', compress=None, _show_progress=False)#
- Return type:
str
- to_bytes(protocol='pickle-array', compress=None, _file_ctx=None, _show_progress=False)#
Serialize itself into bytes.
For more Pythonic code, please use
bytes(...)
.- Parameters:
_file_ctx (
Optional
[BinaryIO
]) – File or filename or serialized bytes where the data is stored.protocol (
str
) – protocol to usecompress (
Optional
[str
]) – compress algorithm to use_show_progress (
bool
) – show progress bar, only works when protocol is pickle or protobuf
- Return type:
bytes
- Returns:
the binary serialization in bytes
- to_dataframe(**kwargs)#
Export itself to a
pandas.DataFrame
object.- Parameters:
kwargs – the extra kwargs will be passed to
pandas.DataFrame.from_dict()
.- Return type:
DataFrame
- Returns:
a
pandas.DataFrame
object
- to_dict(protocol='jsonschema', **kwargs)#
Convert the object into a Python list.
- Parameters:
protocol (
str
) – jsonschema or protobuf- Return type:
List
- Returns:
a Python list
- to_json(protocol='jsonschema', **kwargs)#
Convert the object into a JSON string. Can be loaded via
load_json()
.- Parameters:
protocol (
str
) – jsonschema or protobuf- Return type:
str
- Returns:
a Python list
- to_list(protocol='jsonschema', **kwargs)#
Convert the object into a Python list.
- Parameters:
protocol (
str
) – jsonschema or protobuf- Return type:
List
- Returns:
a Python list
- to_protobuf(ndarray_type=None)#
Convert DocumentArray into a Protobuf message.
- Parameters:
ndarray_type (
Optional
[str
]) – can belist
ornumpy
, if set it will force all ndarray-like object from all Documents toList
ornumpy.ndarray
.- Return type:
DocumentArrayProto
- Returns:
the protobuf message
- to_pydantic_model()#
Convert a DocumentArray object into a Pydantic model.
- Return type:
List
[PydanticDocument
]
- to_strawberry_type()#
Convert a DocumentArray object into a Pydantic model.
- Return type:
List
[StrawberryDocument]
- traverse(traversal_paths, filter_fn=None)#
Return an Iterator of :class:
TraversableSequence
of the leaves when applying the traversal_paths. Each :class:TraversableSequence
is either the root Documents, a ChunkArray or a MatchArray.- Parameters:
traversal_paths (
str
) – a comma-separated string that represents the traversal pathfilter_fn (
Optional
[Callable
[[Document
],bool
]]) – function to filter docs during traversal
- Yield:
:class:
TraversableSequence
of the leaves when applying the traversal_paths.
Example on
traversal_paths
:r: docs in this TraversableSequence
m: all match-documents at adjacency 1
c: all child-documents at granularity 1
r.[attribute]: access attribute of a multi modal document
cc: all child-documents at granularity 2
mm: all match-documents at adjacency 2
cm: all match-document at adjacency 1 and granularity 1
r,c: docs in this TraversableSequence and all child-documents at granularity 1
r[start:end]: access sub document array using slice
- Return type:
Iterable
[T]
- traverse_flat(traversal_paths, filter_fn=None)#
Returns a single flattened :class:
TraversableSequence
with all Documents, that are reached via thetraversal_paths
.Warning
When defining the
traversal_paths
with multiple paths, the returned :class:Documents
are determined at once and not on the fly. This is a different behavior then in :method:traverse
and :method:traverse_flattened_per_path
!- Parameters:
traversal_paths (
str
) – a list of string that represents the traversal pathfilter_fn (
Optional
[Callable
[[Document
],bool
]]) – function to filter docs during traversal
- Return type:
- Returns:
a single :class:
TraversableSequence
containing the document of all leaves when applying the traversal_paths.
- traverse_flat_per_path(traversal_paths, filter_fn=None)#
Returns a flattened :class:
TraversableSequence
per path intraversal_paths
with all Documents, that are reached by the path.- Parameters:
traversal_paths (
str
) – a comma-separated string that represents the traversal pathfilter_fn (
Optional
[Callable
[[Document
],bool
]]) – function to filter docs during traversal
- Yield:
:class:
TraversableSequence
containing the document of all leaves per path.