Search over Nested Structure#

To use find() on multimodal or nested Documents (a multimodal Document is intrinsically a nested Document), you need “subindices”. The word “subindices” represents that you are adding a new sub-level of indexing to the DocumentArray and making it searchable.

Each subindex indexes and stores one nesting level (like '@c' or a custom modality like '@.[image]') and makes it directly searchable. Under the hood, subindices are fully fledged DocumentArrays with their own document store.

See also

To see subindices in action, check here.

Construct subindices#

You can specify subindices when you create a DocumentArray by passing configuration for each desired subindex to the subindex_configs parameter:

from docarray import Document, DocumentArray, dataclass
from docarray.typing import Image, Text


@dataclass
class MyDocument:
    image: Image
    paragraph: Text


_docs = [
    Document(
        MyDocument(
            image='https://docs.docarray.org/_images/apple.png', paragraph='hello world'
        )
    )
    for _ in range(10)
]
da = DocumentArray(
    _docs,
    config={'n_dim': 256},
    storage='annlite',
    subindex_configs={'@.[image]': {'n_dim': 512}, '@.[paragraph]': {'n_dim': 128}},
)
╭───────────────────── Documents Summary ─────────────────────╮
│                                                             │
│   Length                    10                              │
│   Homogenous Documents      True                            │
│   Has nested Documents in   ('chunks',)                     │
│   Common Attributes         ('id', 'embedding', 'chunks')   │
│   Multimodal dataclass      True                            │
│                                                             │
╰─────────────────────────────────────────────────────────────╯
╭──────────────────────── Attributes Summary ────────────────────────╮
│                                                                    │
│   Attribute   Data type         #Unique values   Has empty value   │
│  ────────────────────────────────────────────────────────────────  │
│   chunks      ('ChunkArray',)   10               False             │
│   embedding   ('ndarray',)      10               False             │
│   id          ('str',)          10               False             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
╭────── DocumentArrayAnnlite Config ──────╮
│                                         │
│   n_dim              256                │
│   metric             cosine             │
│   serialize_config   {}                 │
│   data_path          /tmp/tmp_w1yqmpc   │
│   ef_construction    None               │
│   ef_search          None               │
│   max_connection     None               │
│   columns            {}                 │
│                                         │
╰─────────────────────────────────────────╯
from docarray import Document, DocumentArray

_docs = [
    Document(
        text='hello world',
        chunks=[
            Document(
                uri='https://docs.docarray.org/_images/apple.png'
            ).load_uri_to_image_tensor()
        ],
    )
    for _ in range(10)
]
da = DocumentArray(
    _docs,
    config={'n_dim': 256},
    storage='annlite',
    subindex_configs={'@c': {'n_dim': 512}},
)
╭───────────────────────── Documents Summary ─────────────────────────╮
│                                                                     │
│   Length                    10                                      │
│   Homogenous Documents      True                                    │
│   Has nested Documents in   ('chunks',)                             │
│   Common Attributes         ('id', 'text', 'embedding', 'chunks')   │
│   Multimodal dataclass      False                                   │
│                                                                     │
╰─────────────────────────────────────────────────────────────────────╯
╭──────────────────────── Attributes Summary ────────────────────────╮
│                                                                    │
│   Attribute   Data type         #Unique values   Has empty value   │
│  ────────────────────────────────────────────────────────────────  │
│   chunks      ('ChunkArray',)   10               False             │
│   embedding   ('ndarray',)      10               False             │
│   id          ('str',)          10               False             │
│   text        ('str',)          1                False             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
╭────── DocumentArrayAnnlite Config ──────╮
│                                         │
│   n_dim              256                │
│   metric             cosine             │
│   serialize_config   {}                 │
│   data_path          /tmp/tmp_iar4ofr   │
│   ef_construction    None               │
│   ef_search          None               │
│   max_connection     None               │
│   columns            {}                 │
│                                         │
╰─────────────────────────────────────────╯

The subindex_configs dictionary is structured as follows:

  • Keys: Each key in subindex_configs is the name of a subindex. It must be a valid DocumentArray access path (like '@.[image]', '@.[image, paragraph]', '@c', or '@cc').

  • Values: Each value in subindex_configs is the configuration of a subindex. It can be any valid configuration for the given DocumentArray type. Fields that are not given in the subindex configuration are inherited from the parent configuration.

Modify subindices#

Once you’ve constructed a DocumentArray with subindices, modifying the parent DocumentArray automatically updates the subindices.

This means you can insert, extend, delete (etc.) it like any other DocumentArray:

import numpy as np

# construct DocumentArry with subindices
da = DocumentArray(
    config={'n_dim': 256},
    storage='annlite',
    subindex_configs={'@.[image]': {'n_dim': 512}, '@.[paragraph]': {'n_dim': 128}},
)
# extend with Documents, including embeddings
_docs = [
    Document(MyDocument(image='image.png', paragraph='hello world')) for _ in range(10)
]
for d in _docs:
    d.image.embedding = np.random.rand(512)
    d.paragraph.embedding = np.random.rand(128)
with da:
    da.extend(_docs)
import numpy as np

# construct DocumentArry with subindices
da = DocumentArray(
    config={'n_dim': 256},
    storage='annlite',
    subindex_configs={'@c': {'n_dim': 512}},
)
# extend with Documents, including embeddings
_docs = [
    Document(
        text='hello world',
        chunks=[Document(uri='image.png').load_uri_to_image_tensor()],
    )
    for _ in range(10)
]
for d in _docs:
    d.embedding = np.random.rand(256)
    d.chunks[0].embedding = np.random.rand(512)
with da:
    da.extend(_docs)

Search through subindices#

You can search through a subindex using the on= keyword in find() and match():

# find best matching images using .find()
top_image_matches = da.find(query=np.random.rand(512), on='@.[image]')
# find best matching paragraphs using .match()
Document(embedding=np.random.rand(128)).match(da, on='@.[paragraph]')
# find best matching images using .find()
top_image_matches = da.find(query=np.random.rand(512), on='@c')
# find best matching images using .match()
Document(embedding=np.random.rand(512)).match(da, on='@c')

Such a search will return Documents from the subindex. If you are interested in the top-level Documents associated with a match, you can retrieve them by setting return_root=True in find:

top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)
top_level_matches = da.find(query=np.random.rand(512), on='@c', return_root=True)

Note

When you add or change Documents directly on a subindex, the _root_id_ (or parent_id for DocumentArrayInMemory) of new Documents should be set manually for return_root=True to work:

da['@c'].extend(
    Document(embedding=np.random.random(512), tags={'_root_id_': 'your_root_id'})
)