docarray.array.document module#

class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, copy: bool = False, subindex_configs: Optional[Dict[str, None]] = None)[source]#
class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'sqlite', config: Optional[Union[SqliteConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'weaviate', config: Optional[Union[WeaviateConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'annlite', config: Optional[Union[AnnliteConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'elasticsearch', config: Optional[Union[ElasticConfig, Dict]] = None, subindex_configs: Optional[Dict[str, Dict]] = None)
class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'redis', config: Optional[Union[RedisConfig, Dict]] = None)
class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'milvus', config: Optional[Union[MilvusConfig, Dict]] = None)
class docarray.array.document.DocumentArray(_docs: Optional[DocumentArraySourceType] = None, storage: str = 'opensearch', config: Optional[Union[OpenSearchConfig, Dict]] = None)

Bases: AllMixins, BaseDocumentArray

DocumentArray is a list-like container of Document objects.

A DocumentArray can be used to store, embed, and retrieve Document objects.

from docarray import Document, DocumentArray

da = DocumentArray(
    [Document(text='The cake is a lie'), Document(text='Do a barrel roll!')]
)
da.apply(Document.embed_feature_hashing)

query = Document(text='Can i have some cake?').embed_feature_hashing()
query.match(da, metric='jaccard', use_scipy=True)

print(query.matches[:, ('text', 'scores__jaccard__value')])
[['The cake is a lie', 'Do a barrel roll!'], [0.9, 1.0]]

A DocumentArray can also embed its contents using a neural network, process them using an external Flow or Executor, and persist Documents in a Document Store for fast vector search:

from docarray import Document, DocumentArray
import numpy as np

n_dim = 3
metric = 'Euclidean'

# initialize a DocumentArray with ANNLiter Document Store
da = DocumentArray(
    storage='annlite',
    config={'n_dim': n_dim, 'columns': [('price', 'float')], 'metric': metric},
)
# add Documents to the DocumentArray
with da:
    da.extend(
        [
            Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
            for i in range(10)
        ]
    )
# perform vector search
np_query = np.ones(n_dim) * 8
results = da.find(np_query)

See also

For further details, see our user guide.

append(value)#

S.append(value) – append value to the end of the sequence

apply(*args, **kwargs)#

Apply func to every Document in itself, return itself after modification.

Parameters:
  • func – a function that takes Document as input and outputs Document.

  • backend

    thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.

  • num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.

  • show_progress – show a progress bar

Return type:

T

Returns:

itself after modification

apply_batch(*args, **kwargs)#

Batches itself into mini-batches, applies func to every mini-batch, and return itself after the modifications.

EXAMPLE USAGE

from docarray import Document, DocumentArray

da = DocumentArray([Document(text='The cake is a lie') for _ in range(100)])


def func(doc):
    da.texts = [t.upper() for t in da.texts]
    return da


da.apply_batch(func, batch_size=10)
print(da.texts[:3])
['THE CAKE IS A LIE', 'THE CAKE IS A LIE', 'THE CAKE IS A LIE']
Parameters:
  • func – a function that takes DocumentArray as input and outputs DocumentArray.

  • backend

    thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backend passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.

  • num_worker – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • batch_size – Size of each generated batch (except the last batch, which might be smaller). Default: 32

  • shuffle – If set, shuffle the Documents before dividing into minibatches.

  • show_progress – show a progress bar

  • pool – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.

Return type:

T

Returns:

itself after modification

batch(batch_size, shuffle=False, show_progress=False)#

Creates a Generator that yields DocumentArray of size batch_size until docs is fully traversed along the traversal_path. The None docs are filtered out and optionally the docs can be filtered by checking for the existence of a Document attribute. Note, that the last batch might be smaller than batch_size.

Parameters:
  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller, default: 32)

  • shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.

  • show_progress (bool) – if set, show a progress bar when batching documents.

Yield:

a Generator of DocumentArray, each in the length of batch_size

Return type:

Generator[DocumentArray, None, None]

batch_ids(batch_size, shuffle=False)#

Creates a Generator that yields lists of ids of size batch_size until self is fully traversed. Note, that the last batch might be smaller than batch_size.

Parameters:
  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller)

  • shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.

Yield:

a Generator of list of IDs, each in the length of batch_size

Return type:

Generator[List[str], None, None]

property blobs: Optional[List[bytes]]#

Get the blob attribute of all Documents.

Return type:

Optional[List[bytes]]

Returns:

a list of blobs

clear() None -- remove all items from S#
static cloud_delete(name)#

Delete a DocumentArray from the cloud. :type name: str :param name: the name of the DocumentArray to delete.

Return type:

None

static cloud_list(show_table=False)#

List all available arrays in the cloud.

Parameters:

show_table (bool) – if true, show the table of the arrays.

Return type:

List[str]

Returns:

List of available DocumentArray’s names.

classmethod cloud_pull(cls, name, show_progress=False, local_cache=True, *args, **kwargs)#

Pulling a DocumentArray from Jina Cloud Service to local.

Parameters:
  • name (str) – the upload name set during push()

  • show_progress (bool) – if to show a progress bar on pulling

  • local_cache (bool) – store the downloaded DocumentArray to local folder

Return type:

T

Returns:

a DocumentArray object

cloud_push(name, show_progress=False, public=True, branding=None)#

Push this DocumentArray object to Jina Cloud which can be later retrieved via push()

Note

  • Push with the same name will override the existing content.

  • Kinda like a public clipboard where everyone can override anyone’s content. So to make your content survive longer, you may want to use longer & more complicated name.

  • The lifetime of the content is not promised atm, could be a day, could be a week. Do not use it for persistence. Only use this full temporary transmission/storage/clipboard.

Parameters:
  • name (str) – a name that later can be used for retrieve this DocumentArray.

  • show_progress (bool) – if to show a progress bar on pulling

  • public (bool) – by default anyone can pull a DocumentArray if they know its name. Setting this to False will allow only the creator to pull it. This feature of course you to login first.

  • branding (Optional[Dict]) – a dict of branding information to be sent to Jina Cloud. {“icon”: “emoji”, “background”: “#fff”}

Return type:

Dict

property contents: Optional[Union[Sequence[DocumentContentType], ArrayType]]#

Get the content of all Documents.

Return type:

Union[Sequence[DocumentContentType], ArrayType, None]

Returns:

a list of texts, blobs or ArrayType

count(value) integer -- return number of occurrences of value#
classmethod dataloader(path, func, batch_size, protocol='protobuf', compress=None, backend='thread', num_worker=None, pool=None, show_progress=False)#

Load array elements, batches and maps them with a function in parallel, finally yield the batch in DocumentArray

Parameters:
  • path (Union[str, Path]) – Path or filename where the data is stored.

  • func (Callable[[DocumentArray], T]) – a function that takes DocumentArray as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.

  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller)

  • protocol (str) – protocol to use

  • compress (Optional[str]) – compress algorithm to use

  • backend (str) –

    if to use multi-process or multi-thread as the parallelization backend. In general, if your func is IO-bound then perhaps thread is good enough. If your func is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.

  • num_worker (Optional[int]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • pool (Union[Pool, ThreadPool, None]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.

  • show_progress (bool) – if set, show a progressbar

Return type:

Generator[DocumentArray, None, None]

Returns:

embed(embed_model, device='cpu', batch_size=256, to_numpy=False, collate_fn=None)#

Fill embedding of Documents inplace by using embed_model For the evaluation of a model, one can directly use the embed_and_evaluate() function.

Parameters:
  • embed_model (AnyDNN) – The embedding model written in Keras/Pytorch/Paddle

  • device (str) – The computational device for embed_model, can be either cpu or cuda.

  • batch_size (int) – Number of Documents in a batch for embedding

  • to_numpy (bool) – If to store embeddings back to Document in numpy.ndarray or original framework format.

  • collate_fn (Optional[CollateFnType]) – create a mini-batch of Input(s) from the given DocumentArray. Default built-in collate_fn is to use the tensors of the documents.

Return type:

T

Returns:

itself after modified.

embed_and_evaluate(metrics, index_data=None, ground_truth=None, metric_names=None, strict=True, label_tag='label', embed_models=None, embed_funcs=None, device='cpu', batch_size=256, collate_fns=None, distance='cosine', limit=20, normalization=None, exclude_self=False, use_scipy=False, num_worker=1, match_batch_size=100000, query_sample_size=1000, **kwargs)#

Computes ranking evaluation metrics for a given DocumentArray. This function does embedding and matching in the same turn. Thus, you don’t need to call embed and match before it. Instead, it embeds the documents in self (and index_data when provided`) and compute the nearest neighbour itself. This might be done in batches for the index_data object to reduce the memory consumption of the evlauation process. The evaluation itself can be done against a ground_truth DocumentArray or on the basis of labels like it is possible with the :func:evaluate function.

Parameters:
  • metrics (List[Union[str, Callable[..., float]]]) – List of metric names or metric functions to be computed

  • index_data (Optional[DocumentArray]) – The other DocumentArray to match against, if not given, self will be matched against itself. This means that every document in will be compared to all other documents in self to determine the nearest neighbors.

  • ground_truth (Optional[DocumentArray]) – The ground_truth DocumentArray that the DocumentArray compares to.

  • metric_names (Optional[str]) – If provided, the results of the metrics computation will be stored in the evaluations field of each Document with this names. If not provided, the names will be derived from the metric function names.

  • strict (bool) – If set, then left and right sides are required to be fully aligned: on the length, and on the semantic of length. These are preventing you to evaluate on irrelevant matches accidentally.

  • label_tag (str) – Specifies the tag which contains the labels.

  • embed_models (Union[AnyDNN, Tuple[AnyDNN, AnyDNN], None]) – One or two embedding model written in Keras / Pytorch / Paddle for embedding self and index_data.

  • embed_funcs (Union[Callable, Tuple[Callable, Callable], None]) – As an alternative to embedding models, custom embedding functions can be provided.

  • device (str) – the computational device for embed_models, and the matching can be either cpu or cuda.

  • batch_size (Union[int, Tuple[int, int]]) – Number of documents in a batch for embedding.

  • collate_fns (Union[CollateFnType, None, Tuple[Optional[CollateFnType], Optional[CollateFnType]]]) – For each embedding function the respective collate function creates a mini-batch of input(s) from the given DocumentArray. If not provided a default built-in collate_fn uses the tensors of the documents to create input batches.

  • distance (Union[str, Callable[[ArrayType, ArrayType], ndarray]]) – The distance metric.

  • limit (Union[int, float, None]) – The maximum number of matches, when not given defaults to 20.

  • normalization (Optional[Tuple[float, float]]) – A tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].

  • exclude_self (bool) – If set, Documents in index_data with same id as the left-hand values will not be considered as matches.

  • use_scipy (bool) – if set, use scipy as the computation backend. Note, scipy does not support distance on sparse matrix.

  • num_worker (int) – Specifies the number of workers for the execution of the match function.

  • kwargs – Additional keyword arguments to be passed to the metric functions.

  • query_sample_size (int) – For a large number of documents in self the evaluation becomes infeasible, especially, if index_data is large. Therefore, queries are sampled if the number of documents in self exceeds query_sample_size. Usually, this has only small impact on the mean metric values returned by this function. To prevent sampling, you can set query_sample_size to None.

Parma match_batch_size:

The number of documents which are embedded and matched at once. Set this value to a lower value, if you experience high memory consumption.

Return type:

Union[float, List[float], None]

Returns:

A dictionary which stores for each metric name the average evaluation score.

property embeddings: Optional[ArrayType]#

Return a ArrayType stacking all the embedding attributes as rows.

Return type:

Optional[ArrayType]

Returns:

a ArrayType of embedding

classmethod empty(size=0, *args, **kwargs)#

Create a DocumentArray object with size empty Document objects.

Parameters:

size (int) – the number of empty Documents in this container

Return type:

T

Returns:

a DocumentArray object

evaluate(metrics, ground_truth=None, hash_fn=None, metric_names=None, strict=True, label_tag='label', num_relevant_documents_per_label=None, **kwargs)#

Compute ranking evaluation metrics for a given DocumentArray when compared with a ground truth.

If one provides a ground_truth DocumentArray that is structurally identical to self, this function compares the matches of documents inside the DocumentArray to this ground_truth. Alternatively, one can directly annotate the documents by adding labels in the form of tags with the key specified in the label_tag attribute. Those tags need to be added to self as well as to the documents in the matches properties.

This method will fill the evaluations field of Documents inside this DocumentArray and will return the average of the computations

Parameters:
  • metrics (List[Union[str, Callable[..., float]]]) – List of metric names or metric functions to be computed

  • ground_truth (Optional[DocumentArray]) – The ground_truth DocumentArray that the DocumentArray compares to.

  • hash_fn (Optional[Callable[[Document], str]]) – For the evaluation against a ground_truth DocumentArray, this function is used for generating hashes which are used to compare the documents. If not given, Document.id is used.

  • metric_names (Optional[List[str]]) – If provided, the results of the metrics computation will be stored in the evaluations field of each Document with this names. If not provided, the names will be derived from the metric function names.

  • strict (bool) – If set, then left and right sides are required to be fully aligned: on the length, and on the semantic of length. These are preventing you to evaluate on irrelevant matches accidentally.

  • label_tag (str) – Specifies the tag which contains the labels.

  • num_relevant_documents_per_label (Optional[Dict[Any, int]]) – Some metrics, e.g., recall@k, require the number of relevant documents. To apply those to a labeled dataset, one can provide a dictionary which maps labels to the total number of documents with this label.

  • kwargs – Additional keyword arguments to be passed to the metric functions.

Return type:

Dict[str, float]

Returns:

A dictionary which stores for each metric name the average evaluation score.

extend(values)#

S.extend(iterable) – extend sequence by appending elements from the iterable

find(query=None, metric='cosine', limit=20, metric_name=None, exclude_self=False, filter=None, only_id=False, index='text', return_root=False, on=None, **kwargs)#

Returns matching Documents given an input query. If the query is a DocumentArray, Document or ArrayType, exhaustive or approximate nearest neighbor search will be performed depending on whether the storage backend supports ANN. Furthermore, if filter is not None, pre-filtering will be applied along with vector search. If the query is a dict object or, query is None and filter is not None, Documents will be filtered and all matching Documents that match the filter will be returned. In this case, query (if it’s dict) or filter will be used for filtering. The object must follow the backend-specific filter format if the backend supports filtering or DocArray’s query language format. In the latter case, filtering will be applied in the client side not the backend side. If the query is a string or list of strings, a search by text will be performed if the backend supports indexing and searching text fields. If not, a NotImplementedError will be raised.

Parameters:
  • query (Union[DocumentArray, Document, ArrayType, Dict, str, List[str], None]) – the input query to search by

  • limit (Union[int, float, None]) – the maximum number of matches, when not given defaults to 20.

  • metric_name (Optional[str]) – if provided, then match result will be marked with this string.

  • metric (Union[str, Callable[[ArrayType, ArrayType], ndarray]]) – the distance metric.

  • exclude_self (bool) – if set, Documents in results with same id as the query values will not be considered as matches. This is only applied when the input query is Document or DocumentArray.

  • filter (Union[Dict, str, None]) – filter query used for pre-filtering or filtering

  • only_id (bool) – if set, then returning matches will only contain id

  • index (str) – if the query is a string, text search will be performed on the index field, otherwise, this parameter is ignored. By default, the Document text attribute will be used for search, otherwise the tag field specified by index will be used. You can only use this parameter if the storage backend supports searching by text.

  • return_root (Optional[bool]) – if set, then the root-level DocumentArray will be returned

  • on (Optional[str]) – specifies a subindex to search on. If set, the returned DocumentArray will be retrieved from the given subindex.

  • kwargs – other kwargs.

Return type:

Union[DocumentArray, List[DocumentArray]]

Returns:

a list of DocumentArrays containing the closest Document objects for each of the queries in query.

flatten()#

Flatten all nested chunks and matches into one DocumentArray.

Note

Flatten an already flattened DocumentArray will have no effect.

Return type:

DocumentArray

Returns:

a flattened DocumentArray object.

classmethod from_base64(data, protocol='pickle-array', compress=None, _show_progress=False, *args, **kwargs)#
Return type:

T

classmethod from_bytes(data, protocol='pickle-array', compress=None, _show_progress=False, *args, **kwargs)#
Return type:

T

classmethod from_csv(*args, **kwargs)#

# noqa: DAR101 # noqa: DAR102 # noqa: DAR201

Return type:

T

classmethod from_dataframe(df, *args, **kwargs)#

Import a DocumentArray from a pandas.DataFrame object.

Parameters:

df (DataFrame) – a pandas.DataFrame object.

Return type:

T

Returns:

a DocumentArray object

classmethod from_dict(values, protocol='jsonschema', **kwargs)#
Return type:

T

classmethod from_files(*args, **kwargs)#

# noqa: DAR101 # noqa: DAR102 # noqa: DAR201

Return type:

T

classmethod from_huggingface_datasets(*args, **kwargs)#

# noqa: DAR101 # noqa: DAR102 # noqa: DAR201

Return type:

T

classmethod from_json(file, protocol='jsonschema', **kwargs)#
Return type:

T

classmethod from_lines(*args, **kwargs)#

# noqa: DAR101 # noqa: DAR102 # noqa: DAR201

Return type:

T

classmethod from_list(values, protocol='jsonschema', **kwargs)#
Return type:

T

classmethod from_ndarray(*args, **kwargs)#

# noqa: DAR101 # noqa: DAR102 # noqa: DAR201

Return type:

T

classmethod from_ndjson(*args, **kwargs)#

# noqa: DAR101 # noqa: DAR102 # noqa: DAR201

Return type:

T

classmethod from_protobuf(pb_msg)#
Return type:

T

classmethod from_pydantic_model(model)#

Convert a list of PydanticDocument into DocumentArray

Parameters:

model (List[BaseModel]) – the list of pydantic data model objects that represents a DocumentArray

Return type:

T

Returns:

a DocumentArray

classmethod from_strawberry_type(model)#

Convert a list of Strawberry into DocumentArray

Parameters:

model (List[StrawberryDocument]) – the list of strawberry type objects that represents a DocumentArray

Return type:

T

Returns:

a DocumentArray

classmethod get_json_schema(indent=2)#

Return a JSON Schema of DocumentArray class.

Return type:

str

get_vocabulary(min_freq=1, text_attrs=('text',))#

Get the text vocabulary in a dict that maps from the word to the index from all Documents.

Parameters:
  • text_attrs (Tuple[str, ...]) – the textual attributes where vocabulary will be derived from

  • min_freq (int) – the minimum word frequency to be considered into the vocabulary.

Return type:

Dict[str, int]

Returns:

a vocabulary in dictionary where key is the word, value is the index. The value is 2-index, where 0 is reserved for padding, 1 is reserved for unknown token.

index(value[, start[, stop]]) integer -- return first index of value.#

Raises ValueError if the value is not present.

Supporting start and stop arguments is optional, but recommended.

abstract insert(index, value)#

S.insert(index, value) – insert value before index

classmethod load(file, file_format='binary', encoding='utf-8', **kwargs)#

Load array elements from a JSON or a binary file, or a CSV file.

Parameters:
  • file (Union[str, TextIO, BinaryIO]) – File or filename to which the data is saved.

  • file_format (str) – json or binary or csv. JSON and CSV files are human-readable, but binary format gives much smaller size and faster save/load speed. CSV file has very limited compatability, complex DocumentArray with nested structure can not be restored from a CSV file.

  • encoding (str) – encoding used to load data from a file (it only applies to JSON and CSV format). By default, utf-8 is used.

Return type:

T

Returns:

the loaded DocumentArray object

classmethod load_binary(file, protocol='pickle-array', compress=None, _show_progress=False, streaming=False, *args, **kwargs)#

Load array elements from a compressed binary file.

Parameters:
  • file (Union[str, BinaryIO, bytes, Path]) – File or filename or serialized bytes where the data is stored.

  • protocol (str) – protocol to use

  • compress (Optional[str]) – compress algorithm to use

  • _show_progress (bool) – show progress bar, only works when protocol is pickle or protobuf

  • streaming (bool) – if True returns a generator over Document objects.

In case protocol is pickle the Documents are streamed from disk to save memory usage :rtype: Union[DocumentArray, Generator[Document, None, None]] :return: a DocumentArray object

Note

If file is str it can specify protocol and compress as file extensions. This functionality assumes file=file_name.$protocol.$compress where $protocol and $compress refer to a string interpolation of the respective protocol and compress methods. For example if file=my_docarray.protobuf.lz4 then the binary data will be loaded assuming protocol=protobuf and compress=lz4.

classmethod load_csv(file, field_resolver=None, encoding='utf-8')#

Load array elements from a binary file.

Parameters:
  • file (Union[str, TextIO]) – File or filename to which the data is saved.

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in JSON, dict to the field names defined in Document.

  • encoding (str) – encoding used to read a CSV file. By default, utf-8 is used.

Return type:

T

Returns:

a DocumentArray object

classmethod load_json(file, protocol='jsonschema', encoding='utf-8', **kwargs)#

Load array elements from a JSON file.

Parameters:
  • file (Union[str, TextIO]) – File or filename or a JSON string to which the data is saved.

  • protocol (str) – jsonschema or protobuf

  • encoding (str) – encoding used to load data from a JSON file. By default, utf-8 is used.

Return type:

T

Returns:

a DocumentArrayLike object

map(func, backend='thread', num_worker=None, show_progress=False, pool=None)#

Return an iterator that applies function to every element of iterable in parallel, yielding the results.

See also

Parameters:
  • func (Callable[[Document], T]) – a function that takes Document as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.

  • backend (str) –

    thread for multi-threading and process for multi-processing. Defaults to thread. In general, if your func is IO-bound then thread is a good choice. If your func is CPU-bound, then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backing passes the variable via pickle and works in another process. The passed object and the original object do not share the same memory.

  • num_worker (Optional[int]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • show_progress (bool) – show a progress bar

  • pool (Union[Pool, ThreadPool, None]) – use an existing/external process or thread pool. If given, backend is ignored and you will be responsible for closing the pool.

Yield:

anything return from func

Return type:

Generator[T, None, None]

map_batch(func, batch_size, backend='thread', num_worker=None, shuffle=False, show_progress=False, pool=None)#

Return an iterator that applies function to every minibatch of iterable in parallel, yielding the results. Each element in the returned iterator is DocumentArray.

See also

Parameters:
  • batch_size (int) – Size of each generated batch (except the last one, which might be smaller, default: 32)

  • shuffle (bool) – If set, shuffle the Documents before dividing into minibatches.

  • func (Callable[[DocumentArray], T]) – a function that takes DocumentArray as input and outputs anything. You can either modify elements in-place (only with thread backend) or work later on return elements.

  • backend (str) –

    if to use multi-process or multi-thread as the parallelization backend. In general, if your func is IO-bound then perhaps thread is good enough. If your func is CPU-bound then you may use process. In practice, you should try yourselves to figure out the best value. However, if you wish to modify the elements in-place, regardless of IO/CPU-bound, you should always use thread backend.

    Warning

    When using process backend, you should not expect func modify elements in-place. This is because the multiprocessing backing pass the variable via pickle and work in another process. The passed object and the original object do not share the same memory.

  • num_worker (Optional[int]) – the number of parallel workers. If not given, then the number of CPUs in the system will be used.

  • show_progress (bool) – show a progress bar

  • pool (Union[Pool, ThreadPool, None]) – use an existing/external pool. If given, backend is ignored and you will be responsible for closing the pool.

Yield:

anything return from func

Return type:

Generator[T, None, None]

match(darray, metric='cosine', limit=20, normalization=None, metric_name=None, batch_size=None, exclude_self=False, filter=None, only_id=False, use_scipy=False, device='cpu', num_worker=1, on=None, **kwargs)#

Compute embedding based nearest neighbour in another for each Document in self, and store results in matches. For the purpose of evaluation, one can also directly use the embed_and_evaluate() function. .. note:

'cosine', 'euclidean', 'sqeuclidean' are supported natively without extra dependency.
You can use other distance metric provided by ``scipy``, such as `braycurtis`, `canberra`, `chebyshev`,
`cityblock`, `correlation`, `cosine`, `dice`, `euclidean`, `hamming`, `jaccard`, `jensenshannon`,
`kulsinski`, `mahalanobis`, `matching`, `minkowski`, `rogerstanimoto`, `russellrao`, `seuclidean`,
`sokalmichener`, `sokalsneath`, `sqeuclidean`, `wminkowski`, `yule`.
To use scipy metric, please set ``use_scipy=True``.
  • To make all matches values in [0, 1], use dA.match(dB, normalization=(0, 1))

  • To invert the distance as score and make all values in range [0, 1],

    use dA.match(dB, normalization=(1, 0)). Note, how normalization differs from the previous.

  • If a custom metric distance is provided. Make sure that it returns scores as distances and not similarity, meaning the smaller the better.

Parameters:
  • darray (DocumentArray) – the other DocumentArray to match against

  • metric (Union[str, Callable[[ArrayType, ArrayType], ndarray]]) – the distance metric

  • limit (Union[int, float, None]) – the maximum number of matches, when not given defaults to 20.

  • normalization (Optional[Tuple[float, float]]) – a tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].

  • metric_name (Optional[str]) – if provided, then match result will be marked with this string.

  • batch_size (Optional[int]) – if provided, then darray is loaded in batches, where each of them is at most batch_size elements. When darray is big, this can significantly speedup the computation.

  • exclude_self (bool) – if set, Documents in darray with same id as the left-hand values will not be considered as matches.

  • filter (Optional[Dict]) – filter query used for pre-filtering

  • only_id (bool) – if set, then returning matches will only contain id

  • use_scipy (bool) – if set, use scipy as the computation backend. Note, scipy does not support distance on sparse matrix.

  • device (str) – the computational device for .match(), can be either cpu or cuda.

  • num_worker (Optional[int]) –

    the number of parallel workers. If not given, then the number of CPUs in the system will be used.

    Note

    This argument is only effective when batch_size is set.

  • on (Optional[str]) – specifies a subindex to search on. If set, the returned DocumentArray will be retrieved from the given subindex.

  • kwargs – other kwargs.

Return type:

None

plot_embeddings(title='MyDocumentArray', path=None, image_sprites=False, min_image_size=16, channel_axis=-1, start_server=True, host='127.0.0.1', port=None, image_source='tensor', exclude_fields_metas=None)#

Interactively visualize embeddings using the Embedding Projector and store the visualization informations.

Parameters:
  • title (str) – the title of this visualization. If you want to compare multiple embeddings at the same time, make sure to give different names each time and set path to the same value.

  • host (str) – if set, bind the embedding-projector frontend to given host. Otherwise localhost is used.

  • port (Optional[int]) – if set, run the embedding-projector frontend at given port. Otherwise a random port is used.

  • image_sprites (bool) – if set, visualize the dots using uri and tensor.

  • path (Optional[str]) – if set, then append the visualization to an existing folder, where you can compare multiple embeddings at the same time. Make sure to use a different title each time .

  • min_image_size (int) – only used when image_sprites=True. the minimum size of the image

  • channel_axis (int) – only used when image_sprites=True. the axis id of the color channel, -1 indicates the color channel info at the last axis

  • start_server (bool) – if set, start a HTTP server and open the frontend directly. Otherwise, you need to rely on return path and serve by yourself.

  • image_source (str) – specify where the image comes from, can be uri or tensor. empty tensor will fallback to uri

  • exclude_fields_metas (Optional[List[str]]) – specify the fields that you want to exclude from metadata tsv file

Return type:

str

Returns:

the path to the embeddings visualization info.

plot_image_sprites(output=None, canvas_size=512, min_size=16, channel_axis=-1, image_source='tensor', skip_empty=False, show_progress=False, show_index=False, fig_size=(10, 10), keep_aspect_ratio=False)#

Generate a sprite image for all image tensors in this DocumentArray-like object.

An image sprite is a collection of images put into a single image. It is always square-sized. Each sub-image is also square-sized and equally-sized.

Parameters:
  • output (Optional[str]) – Optional path to store the visualization. If not given, show in UI

  • canvas_size (int) – the size of the canvas

  • min_size (int) – the minimum size of the image

  • channel_axis (int) – the axis id of the color channel, -1 indicates the color channel info at the last axis

  • image_source (str) – specify where the image comes from, can be uri or tensor. empty tensor will fallback to uri

  • skip_empty (bool) – skip Document who has no .uri or .tensor.

  • show_index (bool) – show the index on the top-right corner of every image

  • fig_size (Optional[Tuple[int, int]]) – the size of the figure

  • show_progress (bool) – show a progressbar while plotting.

  • keep_aspect_ratio (bool) – preserve the aspect ratio of the image by using the aspect ratio of the first image in self.

Return type:

None

pop([index]) item -- remove and return item at index (default last).#

Raise IndexError if list is empty or index is out of range.

post(host, show_progress=False, batch_size=None, parameters=None, **kwargs)#

Posting itself to a remote Flow/Sandbox and get the modified DocumentArray back

Parameters:
  • host (str) – a host string. Can be one of the following: - grpc://192.168.0.123:8080/endpoint - ws://192.168.0.123:8080/endpoint - http://192.168.0.123:8080/endpoint - jinahub://Hello/endpoint - jinahub+docker://Hello/endpoint - jinahub+docker://Hello/v0.0.1/endpoint - jinahub+docker://Hello/latest/endpoint - jinahub+sandbox://Hello/endpoint

  • show_progress (bool) – if to show a progressbar

  • batch_size (Optional[int]) – number of Document on each request

  • parameters (Optional[Dict]) – parameters to send in the request

Return type:

DocumentArray

Returns:

the new DocumentArray returned from remote

classmethod pull(cls, name, show_progress=False, local_cache=True, *args, **kwargs)#

Pulling a DocumentArray from Jina Cloud Service to local.

Parameters:
  • name (str) – the upload name set during push()

  • show_progress (bool) – if to show a progress bar on pulling

  • local_cache (bool) – store the downloaded DocumentArray to local folder

Return type:

T

Returns:

a DocumentArray object

push(name, show_progress=False, public=True, branding=None)#

Push this DocumentArray object to Jina Cloud which can be later retrieved via push()

Note

  • Push with the same name will override the existing content.

  • Kinda like a public clipboard where everyone can override anyone’s content. So to make your content survive longer, you may want to use longer & more complicated name.

  • The lifetime of the content is not promised atm, could be a day, could be a week. Do not use it for persistence. Only use this full temporary transmission/storage/clipboard.

Parameters:
  • name (str) – a name that later can be used for retrieve this DocumentArray.

  • show_progress (bool) – if to show a progress bar on pulling

  • public (bool) – by default anyone can pull a DocumentArray if they know its name. Setting this to False will allow only the creator to pull it. This feature of course you to login first.

  • branding (Optional[Dict]) – a dict of branding information to be sent to Jina Cloud. {“icon”: “emoji”, “background”: “#fff”}

Return type:

Dict

reduce(other)#

Reduces other and the current DocumentArray into one DocumentArray in-place. Changes are applied to the current DocumentArray. Reducing 2 DocumentArrays consists in adding Documents in the second DocumentArray to the first DocumentArray if they do not exist. If a Document exists in both DocumentArrays, the data properties are merged with priority to the first Document (that is, to the current DocumentArray’s Document). The matches and chunks are also reduced in the same way. :type other: T :param other: DocumentArray :rtype: T :return: DocumentArray

reduce_all(others)#

Reduces a list of DocumentArrays and this DocumentArray into one DocumentArray. Changes are applied to this DocumentArray in-place.

Reduction consists in reducing this DocumentArray with every DocumentArray in others sequentially using DocumentArray.:method:reduce. The resulting DocumentArray contains Documents of all DocumentArrays. If a Document exists in many DocumentArrays, data properties are merged with priority to the left-most DocumentArrays (that is, if a data attribute is set in a Document belonging to many DocumentArrays, the attribute value of the left-most DocumentArray is kept). Matches and chunks of a Document belonging to many DocumentArrays are also reduced in the same way. Other non-data properties are ignored.

Note

  • Matches are not kept in a sorted order when they are reduced. You might want to re-sort them in a later

    step.

  • The final result depends on the order of DocumentArrays when applying reduction.

Parameters:

others (List[T]) – List of DocumentArrays to be reduced

Return type:

T

Returns:

the resulting DocumentArray

remove(value)#

S.remove(value) – remove first occurrence of value. Raise ValueError if the value is not present.

reverse()#

S.reverse() – reverse IN PLACE

sample(k, seed=None)#

random sample k elements from DocumentArray without replacement.

Parameters:
  • k (int) – Number of elements to sample from the document array.

  • seed (Optional[int]) – initialize the random number generator, by default is None. If set will save the state of the random function to produce certain outputs.

Return type:

DocumentArray

Returns:

A sampled list of Document represented as DocumentArray.

save(file, file_format='binary', encoding='utf-8')#

Save array elements into a JSON, a binary file or a CSV file.

Parameters:
  • file (Union[str, TextIO, BinaryIO]) – File or filename to which the data is saved.

  • file_format (str) – json or binary or csv. JSON and CSV files are human-readable, but binary format gives much smaller size and faster save/load speed. Note that, CSV file has very limited compatability, complex DocumentArray with nested structure can not be restored from a CSV file.

  • encoding (str) – encoding used to save data into a file (it only applies to JSON and CSV format). By default, utf-8 is used.

Return type:

None

save_binary(file, protocol='pickle-array', compress=None)#

Save array elements into a binary file.

Parameters:
  • file (Union[str, BinaryIO]) – File or filename to which the data is saved.

  • protocol (str) – protocol to use

  • compress (Optional[str]) –

    compress algorithm to use

    Note

    If file is str it can specify protocol and compress as file extensions. This functionality assumes file=file_name.$protocol.$compress where $protocol and $compress refer to a string interpolation of the respective protocol and compress methods. For example if file=my_docarray.protobuf.lz4 then the binary data will be created using protocol=protobuf and compress=lz4.

Comparing to save_json(), it is faster and the file is smaller, but not human-readable.

Note

To get a binary presentation in memory, use bytes(...).

Return type:

None

save_csv(file, flatten_tags=True, exclude_fields=None, dialect='excel', with_header=True, encoding='utf-8')#

Save array elements into a CSV file.

Parameters:
  • file (Union[str, TextIO]) – File or filename to which the data is saved.

  • flatten_tags (bool) – if set, then all fields in Document.tags will be flattened into tag__fieldname and stored as separated columns. It is useful when tags contain a lot of information.

  • exclude_fields (Optional[Sequence[str]]) – if set, those fields wont show up in the output CSV

  • dialect (Union[str, Dialect]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be a csv.Dialect class that groups specific formatting parameters together.

  • encoding (str) – encoding used to save the data into a CSV file. By default, utf-8 is used.

Return type:

None

save_embeddings_csv(file, encoding='utf-8', **kwargs)#

Save embeddings to a CSV file

This function utilizes numpy.savetxt() internal.

Parameters:
  • file (Union[str, TextIO]) – File or filename to which the data is saved.

  • encoding (str) – encoding used to save the data into a file. By default, utf-8 is used.

  • kwargs – extra kwargs will be passed to numpy.savetxt().

Return type:

None

save_gif(output, channel_axis=-1, duration=200, size_ratio=1.0, inline_display=False, image_source='tensor', skip_empty=False, show_index=False, show_progress=False)#

Save a gif of the DocumentArray. Each frame corresponds to a Document.uri/.tensor in the DocumentArray.

Parameters:
  • output (str) – the file path to save the gif to.

  • channel_axis (int) – the color channel axis of the tensor.

  • duration (int) – the duration of each frame in milliseconds.

  • size_ratio (float) – the size ratio of each frame.

  • inline_display (bool) – if to show the gif in Jupyter notebook.

  • image_source (str) – the source of the image in Document atribute.

  • skip_empty (bool) – if to skip empty documents.

  • show_index (bool) – if to show the index of the document in the top-right corner.

  • show_progress (bool) – if to show a progress bar.

Return type:

None

Returns:

save_json(file, protocol='jsonschema', encoding='utf-8', **kwargs)#

Save array elements into a JSON file.

Comparing to save_binary(), it is human-readable but slower to save/load and the file size larger.

Parameters:
  • file (Union[str, TextIO]) – File or filename to which the data is saved.

  • protocol (str) – jsonschema or protobuf

  • encoding (str) – encoding used to save data into a JSON file. By default, utf-8 is used.

Return type:

None

shuffle(seed=None)#

Randomly shuffle documents within the DocumentArray.

Parameters:

seed (Optional[int]) – initialize the random number generator, by default is None. If set will save the state of the random function to produce certain outputs.

Return type:

DocumentArray

Returns:

The shuffled list of Document represented as DocumentArray.

split_by_tag(tag)#

Split the DocumentArray into multiple DocumentArray according to the tag value of each Document.

Parameters:

tag (str) – the tag name to split stored in tags.

Return type:

Dict[Any, DocumentArray]

Returns:

a dict where Documents with the same value on tag are grouped together, their orders are preserved from the original DocumentArray.

Note

If the tags of Document do not contains the specified tag, return an empty dict.

summary()#

Print the structure and attribute summary of this DocumentArray object.

Warning

Calling {meth}`.summary` on large DocumentArray can be slow.

property tensors: Optional[ArrayType]#

Return a ArrayType stacking all tensor.

The tensor attributes are stacked together along a newly created first dimension (as if you would stack using np.stack(X, axis=0)).

Warning

This operation assumes all tensors have the same shape and dtype. All dtype and shape values are assumed to be equal to the values of the first element in the DocumentArray

Return type:

Optional[ArrayType]

Returns:

a ArrayType of tensors

property texts: Optional[List[str]]#

Get text of all Documents

Return type:

Optional[List[str]]

Returns:

a list of texts

to_base64(protocol='pickle-array', compress=None, _show_progress=False)#
Return type:

str

to_bytes(protocol='pickle-array', compress=None, _file_ctx=None, _show_progress=False)#

Serialize itself into bytes.

For more Pythonic code, please use bytes(...).

Parameters:
  • _file_ctx (Optional[BinaryIO]) – File or filename or serialized bytes where the data is stored.

  • protocol (str) – protocol to use

  • compress (Optional[str]) – compress algorithm to use

  • _show_progress (bool) – show progress bar, only works when protocol is pickle or protobuf

Return type:

bytes

Returns:

the binary serialization in bytes

to_dataframe(**kwargs)#

Export itself to a pandas.DataFrame object.

Parameters:

kwargs – the extra kwargs will be passed to pandas.DataFrame.from_dict().

Return type:

DataFrame

Returns:

a pandas.DataFrame object

to_dict(protocol='jsonschema', **kwargs)#

Convert the object into a Python list.

Parameters:

protocol (str) – jsonschema or protobuf

Return type:

List

Returns:

a Python list

to_json(protocol='jsonschema', **kwargs)#

Convert the object into a JSON string. Can be loaded via load_json().

Parameters:

protocol (str) – jsonschema or protobuf

Return type:

str

Returns:

a Python list

to_list(protocol='jsonschema', **kwargs)#

Convert the object into a Python list.

Parameters:

protocol (str) – jsonschema or protobuf

Return type:

List

Returns:

a Python list

to_protobuf(ndarray_type=None)#

Convert DocumentArray into a Protobuf message.

Parameters:

ndarray_type (Optional[str]) – can be list or numpy, if set it will force all ndarray-like object from all Documents to List or numpy.ndarray.

Return type:

DocumentArrayProto

Returns:

the protobuf message

to_pydantic_model()#

Convert a DocumentArray object into a Pydantic model.

Return type:

List[PydanticDocument]

to_strawberry_type()#

Convert a DocumentArray object into a Pydantic model.

Return type:

List[StrawberryDocument]

traverse(traversal_paths, filter_fn=None)#

Return an Iterator of :class:TraversableSequence of the leaves when applying the traversal_paths. Each :class:TraversableSequence is either the root Documents, a ChunkArray or a MatchArray.

Parameters:
  • traversal_paths (str) – a comma-separated string that represents the traversal path

  • filter_fn (Optional[Callable[[Document], bool]]) – function to filter docs during traversal

Yield:

:class:TraversableSequence of the leaves when applying the traversal_paths.

Example on traversal_paths:

  • r: docs in this TraversableSequence

  • m: all match-documents at adjacency 1

  • c: all child-documents at granularity 1

  • r.[attribute]: access attribute of a multi modal document

  • cc: all child-documents at granularity 2

  • mm: all match-documents at adjacency 2

  • cm: all match-document at adjacency 1 and granularity 1

  • r,c: docs in this TraversableSequence and all child-documents at granularity 1

  • r[start:end]: access sub document array using slice

Return type:

Iterable[T]

traverse_flat(traversal_paths, filter_fn=None)#

Returns a single flattened :class:TraversableSequence with all Documents, that are reached via the traversal_paths.

Warning

When defining the traversal_paths with multiple paths, the returned :class:Documents are determined at once and not on the fly. This is a different behavior then in :method:traverse and :method:traverse_flattened_per_path!

Parameters:
  • traversal_paths (str) – a list of string that represents the traversal path

  • filter_fn (Optional[Callable[[Document], bool]]) – function to filter docs during traversal

Return type:

DocumentArray

Returns:

a single :class:TraversableSequence containing the document of all leaves when applying the traversal_paths.

traverse_flat_per_path(traversal_paths, filter_fn=None)#

Returns a flattened :class:TraversableSequence per path in traversal_paths with all Documents, that are reached by the path.

Parameters:
  • traversal_paths (str) – a comma-separated string that represents the traversal path

  • filter_fn (Optional[Callable[[Document], bool]]) – function to filter docs during traversal

Yield:

:class:TraversableSequence containing the document of all leaves per path.