Add New Document Store#

DocumentArray can be easily extended to support new Document Store. As we have seen in the previous chapters, a document store can be a SQL/NoSQL/vector database, or even an in-memory data structure.

For DocArray, the motivation of on-boarding a new store is often:

  • having persistence that better fits to the use case;

  • pulling from an existing data source;

  • supporting advanced query languages, e.g. nearest-neighbor retrieval.

For the database vendor, the motivation is often:

  • having a powerful, well-designed and well-maintained Python client for your document store;

  • plugging your document store into Jina AI ecosystems (e.g. Jina, Hub, CLIP-as-service, Finetuner, etc.) and making synergy with Jina AI.

After the extension, users can enjoy convenient and powerful DocumentArray API on top of your document store. It promises the same user experience just like using a regular DocumentArray, no extra learning is required.

This chapter gives you a walk-through on how to add a new document store. To be specific, in this chapter we are extending DocumentArray to support a new document store called mydocstore. The final usage would look like the following:

from docarray import DocumentArray

da = DocumentArray(storage='mydocstore', config={...})

Let’s get started!

Step 1: create the folder#

Go to docarray/array/storage folder, create a sub-folder for your document store. Let’s call it mydocstore. You need to create four empty files in that folder:

README.md
docarray
    |
    |--- array
            |
            |--- storage
                    |
                    |--- mydocstore
                            |
                            |--- __init__.py
                            |--- getsetdel.py
                            |--- seqlike.py
                            |--- backend.py

These four files consist of necessary interface for making the extension work on DocumentArray. Additionally, if your storage backend supports approximate nearest-neighbor search, you can include another file ‘find.py’.

Step 2: implement getsetdel.py#

Your getsetdel.py should look like the following:

from docarray.array.storage.base.getsetdel import BaseGetSetDelMixin
from docarray import Document


class GetSetDelMixin(BaseGetSetDelMixin):
    def _get_doc_by_id(self, _id: str) -> 'Document':
        # to be implemented
        ...

    def _del_doc_by_id(self, _id: str):
        # to be implemented
        ...

    def _set_doc_by_id(self, _id: str, value: 'Document'):
        # to be implemented
        ...

    def _load_offset2ids(self):
        # to be implemented
        ...

    def _save_offset2ids(self):
        # to be implemented
        ...

You need to implement the above five functions, which correspond to the logics of get/set/delete items via a string .id. They are essential to ensure DocumentArray works.

Note that DocumentArray maintains an offset2ids mapping to allow a list-like behaviour. This mapping is inherited from the BaseGetSetDelMixin. Therefore, you need to implement methods to persist this mapping, in case you want to also persist the ordering of Documents inside the storage. However, the list-like structure implemented by Offset2id can introduce performance bottlenecks, so you can disable this feature by passing a flag when constructing the backend.

In step 4 you will see how you should read in this flag (list_like) from the user. Here you have to use it do adapt some operations.

In your backend implementations, in getsetdel.py, you have to construct the _offset2ids member variable by passing list_like flag as follows:

    def _load_offset2ids(self):
        if self._list_like:
            ids = self._get_offset2ids_meta()
            self._offset2ids = Offset2ID(ids, list_like=self._list_like)
        else:
            self._offset2ids = Offset2ID([], list_like=self._list_like)

Note that this flag should be stored in self._list_like, so that other parts of the DocumentArray implementation can leverage it.

Keep in mind that _del_doc_by_id and _set_doc_by_id must not update offset2ids, we handle that for you in an upper level. Also, make sure that _set_doc_by_id performs an upsert operation and removes the old ID (_id) in case value.id is different from _id.

Tip

Let’s call the above five functions as the essentials.

If you aim for high performance, it is recommeneded to implement other methods without leveraging your essentials. They are: _get_docs_by_ids, _del_docs_by_ids, _clear_storage, _set_doc_value_pairs, _set_doc_value_pairs_nested, _set_docs_by_ids. You can get their full signatures from BaseGetSetDelMixin. These functions define more fine-grained get/set/delete logics that are frequently used in DocumentArray.

Implementing them is fully optional, and you can only implement some of them not all of them. If you are not implementing them, those methods use a generic-but-slow version based on your five essentials.

See also

As a reference, you can check out how we implement for SQLite, check out GetSetDelMixin.

Step 3: implement seqlike.py#

Your seqlike.py should look like the following:

from typing import Iterable, Iterator, Union, TYPE_CHECKING
from docarray.array.storage.base.seqlike import BaseSequenceLikeMixin

if TYPE_CHECKING:
    from docarray import Document


class SequenceLikeMixin(BaseSequenceLikeMixin):
    def __eq__(self, other):
        ...

    def __contains__(self, x: Union[str, 'Document']):
        ...

    def __repr__(self):
        ...

    def __add__(self, other: Union['Document', Iterable['Document']]):
        ...

    def __len__(self):
        ...

    def insert(self, index: int, value: 'Document'):
        # Optional. By default, this adds a new item and update offset2id
        # if you want to customize this, make sure to handle offset2id
        ...

    def _append(self, value: 'Document'):
        # Optional. Override this if you have a better implementation than inserting at the last position
        ...

    def _extend(self, values: Iterable['Document']) -> None:
        # Optional. Override this if you have better implementation than appending one by one
        ...

    def __iter__(self) -> Iterator['Document']:
        # Optional. By default, this relies on offset2id to iterate
        ...

Most of the interfaces come from Python standard MutableSequence.

See also

As a reference, to see how we implement for SQLite, check out SequenceLikeMixin.

To support the list-like feature, the list-like APIs should perform flag checking only when the offset2id structure is called as follows:

def _extend(self, docs: Iterable['Document']):
    da = DocumentArray(docs)
    for batch_of_docs in da.batch(self._config.batch_size):
        self._upload_batch(batch_of_docs)
        if self._list_like:
            self._offset2ids.extend(batch_of_docs[:, 'id'])

Step 4: implement backend.py#

Your backend.py should look like the following:

from typing import Optional, TYPE_CHECKING, Union, Dict
from dataclasses import dataclass

from docarray.array.storage.base.backend import BaseBackendMixin

if TYPE_CHECKING:
    from docarray.typing import (
        DocumentArraySourceType,
    )


@dataclass
class MyDocStoreConfig:
    config1: str
    config2: str
    config3: Dict
    ...


class BackendMixin(BaseBackendMixin):
    def _init_storage(
        self,
        _docs: Optional['DocumentArraySourceType'] = None,
        config: Optional[Union[MyDocStoreConfig, Dict]] = None,
        **kwargs
    ):
        super()._init_storage(_docs, config, **kwargs)
        ...

    def _ensure_unique_config(
        self,
        config_root: dict,
        config_subindex: dict,
        config_joined: dict,
        subindex_name: str,
    ) -> dict:
        ...  # ensure unique identifiers here
        return config_joined

MyDocStoreConfig is a dataclass for containing the configs. You can expose arguments of your document store to this data class and allow users to customize them. In init_storage function, you need to parse config either from MyDocStoreConfig object or a Dict.

To allow the disabling of list-like features, your configuration should accept the flag list_like as follows:

@dataclass
class MyDocStoreConfig:
    config1: str
    config2: str
    list_like: bool
    config3: Dict
    ...

By default, this should be set to True.

Further, you have to store the value of this flag in self._list_like. Some methods that are handled outside of your control will take the value form there and use it appropriately.

_init_storage is a very important function to be called during the DocumentArray construction. You need to handle different construction and copy behaviors in this function.

_ensure_unique_config is needed to support DocArray’s subindex feature. A subindex inherits its configuration from the root index, unless a field of the configuration is explicitly provided to the subindex. Usually however, each table in a database has to have a unique identifier (e.g. ‘name’, ‘table_name’, ‘data_path’, etc.). In order to avoid clashes you need to make sure that this identifier is actually unique between parent und subindices, despite the inheritance of configurations.

See also

As a reference, you can check out how we implement for SQLite here: BackendMixin.

Step 5: implement find.py#

If your storage backend supports approximate nearest neighbor search, you can allow users to use this feature within docarray. To do so, add a find.py file that looks like the following:

from typing import TYPE_CHECKING, TypeVar, List, Union

if TYPE_CHECKING:
    import numpy as np

    # Define the expected input type that your ANN search supports
    MyDocumentStoreArrayType = TypeVar('MyDocumentStoreArrayType', np.ndarray, ...)


class FindMixin:
    def _find_similar_vectors(
        self, query: 'MyDocumentStoreArrayType', limit=10
    ) -> 'DocumentArray':
        """Expects a MyDocumentStoreArrayType vector query and should return a DocumentArray of results retrieved from
        the storage backend"""
        ...

    def _find(
        self, query: 'OpenSearchArrayType', limit: int = 10, **kwargs
    ) -> Union['DocumentArray', List['DocumentArray']]:
        """Returns `limit` approximate nearest neighbors given a batch of input queries.
        If the query is a single query, should return a DocumentArray, otherwise a list of DocumentArrays containing
        the closest Documents for each query.
        """
        ...

Make sure to store the distance scores in the .scores dictionary of the Documents that are being returned with the distance value as key.

Step 6: summarize everything in __init__.py.#

Your __init__.py should look like the following:

from abc import ABC

from .backend import BackendMixin, MyDocStoreConfig
from .getsetdel import GetSetDelMixin
from .seqlike import SequenceLikeMixin

__all__ = ['StorageMixins', 'MyDocStoreConfig']


class StorageMixins(BackendMixin, GetSetDelMixin, SequenceLikeMixin, ABC):
    ...

Just copying and pasting it should work.

If you have implemented a find.py module, make sure to also inherit the FindMixin:

class StorageMixins(FindMixin, BackendMixin, GetSetDelMixin, SequenceLikeMixin, ABC):
    ...

Step 7: subclass from DocumentArray#

Create a file mydocstore.py under docarray/array/

README.md
docarray
    |
    |--- array
            |
            |--- mydocstore.py
            |--- storage
                    |
                    |--- mydocstore
                            |
                            |--- __init__.py
                            |--- getsetdel.py
                            |--- seqlike.py
                            |--- backend.py

The file content should look like the following:

from .document import DocumentArray

from .storage.mydocstore import StorageMixins, MyDocStoreConfig

__all__ = ['MyDocStoreConfig', 'DocumentArrayMyDocStore']


class DocumentArrayMyDocStore(StorageMixins, DocumentArray):
    def __new__(cls, *args, **kwargs):
        return super().__new__(cls)

Step 8: add entrypoint to DocumentArray#

We are almost there! Now we need to add the entrypoint to DocumentArray constructor to allow user to use the mydocstore backend as follows:

from docarray import DocumentArray

da = DocumentArray(storage='mydocstore')

Go to docarray/array/document.py and add mydocstore there:

class DocumentArray(AllMixins, BaseDocumentArray):
    
    ...
    
    def __new__(cls, *args, storage: str = 'memory', **kwargs) -> 'DocumentArrayLike':
        if cls is DocumentArray:
            if storage == 'mydocstore':
                from .mydocstore import DocumentArrayMyDocStore

                instance = super().__new__(DocumentArrayMyDocStore)
            elif storage == 'memory':
                from .memory import DocumentArrayInMemory
                ...  

Done! Now you should be able to use it like DocumentArrayMyDocStore!

On pull request: add tests and type-hint#

You are welcome to contribute your extension back to DocArray. You need to include DocumentArrayMyDocStore in at least the following tests:

tests/unit/array/test_advance_indexing.py
tests/unit/array/test_sequence.py
tests/unit/array/test_construct.py

Please also add @overload type hint to docarray/array/document.py.

class DocumentArray(AllMixins, BaseDocumentArray):
    ...

    @overload
    def __new__(
        cls,
        _docs: Optional['DocumentArraySourceType'] = None,
        storage: str = 'mydocstore',
        config: Optional[Union['MyDocStoreConfig', Dict]] = None,
    ) -> 'DocumentArrayMyDocStore':
        """Create a MyDocStore-powered DocumentArray object."""
        ...

Now you are ready to commit the contribution and open a pull request.