Elasticsearch#
You can use Elasticsearch as a document store for DocumentArray. It’s suitable for faster Document retrieval on embeddings, i.e. .match()
, .find()
.
Tip
This feature requires elasticsearch
. You can install it via pip install "docarray[elasticsearch]".
Usage#
Start Elastic service#
To use Elasticsearch as the storage backend, it is required to have the Elasticsearch service started. Create docker-compose.yml
as follows:
version: "3.3"
services:
elastic:
image: docker.elastic.co/elasticsearch/elasticsearch:8.2.0
environment:
- xpack.security.enabled=false
- discovery.type=single-node
ports:
- "9200:9200"
networks:
- elastic
networks:
elastic:
name: elastic
Then
pip install -U docarray[elasticsearch]
docker-compose up
Create DocumentArray with Elasticsearch backend#
Assuming service is started using the default configuration (i.e. server address is http://localhost:9200
), you can instantiate a DocumentArray with Elasticsearch storage as such:
from docarray import DocumentArray
da = DocumentArray(storage='elasticsearch', config={'n_dim': 128})
The usage would be the same as the ordinary DocumentArray, but the dimension of an embedding for a Document must be provided at creation time.
Secure connection#
By default, Elasticsearch server runs with security layer that disables the plain HTTP connection. You can pass the host
with api_id
or ca_certs
inside es_config
to the constructor. For example,
from docarray import DocumentArray
da = DocumentArray(
storage='elasticsearch',
config={
'hosts': 'https://elastic:PRq7je_hJ4i4auh+Hq+*@localhost:9200',
'n_dim': 128,
'es_config': {'ca_certs': '/Users/hanxiao/http_ca.crt'},
},
)
Here is the official Documentation for you to get certificate, password etc.
To access a DocumentArray formerly persisted, you can specify index_name
and the hosts.
The following example will build a DocumentArray with previously stored data from old_stuff
on http://localhost:9200
:
from docarray import DocumentArray, Document
da = DocumentArray(
storage='elasticsearch',
config={'index_name': 'old_stuff', 'n_dim': 128},
)
with da:
da.extend([Document() for _ in range(1000)])
da2 = DocumentArray(
storage='elasticsearch',
config={'index_name': 'old_stuff', 'n_dim': 128},
)
da2.summary()
Documents Summary
Length 2000
Homogenous Documents True
Common Attributes ('id', 'embedding')
Attributes Summary
Attribute Data type #Unique values Has empty value
─────────────────────────────────────────────────────────────
embedding ('ndarray',) 1000 False
id ('str',) 1000 False
Storage Summary
Backend ElasticSearch
Host http://localhost:9200
Distance cosine
Vector dimension 128
ES config {}
[0.14890289 0.3168339 0.03050802 0.06785086 0.94719299 0.32490566
...]
Other functions behave the same as in-memory DocumentArray.
Bulk request customization#
You can customize how bulk requests is being sent to Elasticsearch when adding documents by adding additional kwargs
on extend
method call. See the official Documentation for more details. See the following code for example:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='elasticsearch',
config={'n_dim': 3, 'columns': {'price': 'int'}, 'distance': 'l2_norm'},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
],
thread_count=4,
chunk_size=500,
max_chunk_bytes=104857600,
queue_size=4,
)
Note
batch_size
configuration will be overriden by chunk_size
kwargs if provided
Tip
You can read more about parallel bulk config and their default values here
Vector search with filter query#
You can perform Approximate Nearest Neighbor Search and pre-filter results using a filter query that follows ElasticSearch’s DSL.
Consider Documents with embeddings [0,0,0]
up to [9,9,9]
where the document with embedding [i,i,i]
has as tag price
with value i
. We can create such example with the following code:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='elasticsearch',
config={'n_dim': n_dim, 'columns': {'price': 'int'}, 'distance': 'l2_norm'},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
print('\nIndexed Prices:\n')
for embedding, price in zip(da.embeddings, da[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
Consider we want the nearest vectors to the embedding [8. 8. 8.]
, with the restriction that
prices must follow a filter. As an example, let’s consider that retrieved documents must have price
value lower
or equal than max_price
. We can encode this information in ElasticSearch using filter = {'range': {'price': {'lte': max_price}}}
.
Then the search with the proposed filter can be implemented and used with the following code.
Note
For Elasticsearch, the distance scores can be accessed in the Document’s .scores
dictionary under the key 'score'
.
max_price = 7
n_limit = 4
np_query = np.ones(n_dim) * 8
print(f'\nQuery vector: \t{np_query}')
filter = {'range': {'price': {'lte': max_price}}}
results = da.find(np_query, filter=filter, limit=n_limit)
print('\nEmbeddings Nearest Neighbours with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This would print:
Embeddings Nearest Neighbours with "price" at most 7:
embedding=[7. 7. 7.], price=7
embedding=[6. 6. 6.], price=6
embedding=[5. 5. 5.], price=5
embedding=[4. 4. 4.], price=4
Additionally you can tune the approximate kNN for speed or accuracy by providing num_candidates
kwarg when calling the find
method:
results = da.find(np_query, filter=filter, limit=n_limit, num_candidates=100)
Tip
You can read more about approximate kNN tuning here
Search by filter query#
You can search with user-defined query filters using the .find
method. Such queries can be constructed following the
guidelines in ElasticSearch’s Documentation.
Consider you store Documents with a certain tag price
into ElasticSearch and you want to retrieve all Documents
with price
lower or equal to some max_price
value.
You can index such Documents as follows:
from docarray import Document, DocumentArray
n_dim = 3
da = DocumentArray(
storage='elasticsearch',
config={
'n_dim': n_dim,
'columns': {'price': 'float'},
},
)
with da:
da.extend([Document(id=f'r{i}', tags={'price': i}) for i in range(10)])
print('\nIndexed Prices:\n')
for price in da[:, 'tags__price']:
print(f'\t price={price}')
Then you can retrieve all documents whose price is lower than or equal to max_price
by applying the following
filter:
max_price = 3
n_limit = 4
filter = {
'range': {
'price': {
'lte': max_price,
}
}
}
results = da.find(filter=filter)
print('\n Returned examples that verify filter "price at most 3":\n')
for price in results[:, 'tags__price']:
print(f'\t price={price}')
This would print
Returned examples that satisfy condition "price at most 3":
price=0
price=1
price=2
price=3
Search by .text
field#
Text search can be easily leveraged in a DocumentArray
with storage='elasticsearch'
.
To do this text needs to be indexed using the boolean flag 'index_text'
which is set when
the DocumentArray
is created with config={'index_text': True, ...}
.
The following example builds a DocumentArray
with several documents containing text and searches
for those that have pizza
in their text description.
from docarray import DocumentArray, Document
da = DocumentArray(storage='elasticsearch', config={'n_dim': 2, 'index_text': True})
with da:
da.extend(
[
Document(text='Person eating'),
Document(text='Person eating pizza'),
Document(text='Pizza restaurant'),
]
)
pizza_docs = da.find('pizza')
pizza_docs[:, 'text']
will print
['Pizza restaurant', 'Person eating pizza']
Config#
The following configs can be set:
Name |
Description |
Default |
---|---|---|
|
Hostname of the Elasticsearch server |
|
|
Other ES configs in a Dict and pass to |
None |
|
Elasticsearch index name; the class name of Elasticsearch index object to set this DocumentArray |
None |
|
Dimensionality of the embeddings |
None |
|
Similarity metric in Elasticsearch |
|
|
The size of the dynamic list for the nearest neighbors. |
|
|
Similarity metric in Elasticsearch |
|
|
Boolean flag indicating whether to index |
False |
|
List of tags to index |
False |
|
Batch size used to handle storage refreshes/updates |
64 |
|
Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. |
True |
|
Boolean flag indicating whether to store |
True |
Tip
You can read more about HNSW parameters and their default values here
Tip
Note that it is plural hosts
not host
, to comply with Elasticsearch client’s interface.