Elasticsearch#
You can use Elasticsearch as a document store for DocumentArray. It’s suitable for faster Document retrieval on embeddings, i.e. .match(), .find().
Tip
This feature requires elasticsearch. You can install it via pip install "docarray[elasticsearch]".
Usage#
Start Elastic service#
To use Elasticsearch as the storage backend, it is required to have the Elasticsearch service started. Create docker-compose.yml as follows:
version: "3.3"
services:
elastic:
image: docker.elastic.co/elasticsearch/elasticsearch:8.2.0
environment:
- xpack.security.enabled=false
- discovery.type=single-node
ports:
- "9200:9200"
networks:
- elastic
networks:
elastic:
name: elastic
Then
pip install -U docarray[elasticsearch]
docker-compose up
Create DocumentArray with Elasticsearch backend#
Assuming service is started using the default configuration (i.e. server address is http://localhost:9200), you can instantiate a DocumentArray with Elasticsearch storage as such:
from docarray import DocumentArray
da = DocumentArray(storage='elasticsearch', config={'n_dim': 128})
The usage would be the same as the ordinary DocumentArray, but the dimension of an embedding for a Document must be provided at creation time.
Secure connection#
By default, Elasticsearch server runs with security layer that disables the plain HTTP connection. You can pass the host with api_id or ca_certs inside es_config to the constructor. For example,
from docarray import DocumentArray
da = DocumentArray(
storage='elasticsearch',
config={
'hosts': 'https://elastic:PRq7je_hJ4i4auh+Hq+*@localhost:9200',
'n_dim': 128,
'es_config': {'ca_certs': '/Users/hanxiao/http_ca.crt'},
},
)
Here is the official Documentation for you to get certificate, password etc.
To access a DocumentArray formerly persisted, you can specify index_name and the hosts.
The following example will build a DocumentArray with previously stored data from old_stuff on http://localhost:9200:
from docarray import DocumentArray, Document
da = DocumentArray(
storage='elasticsearch',
config={'index_name': 'old_stuff', 'n_dim': 128},
)
with da:
da.extend([Document() for _ in range(1000)])
da2 = DocumentArray(
storage='elasticsearch',
config={'index_name': 'old_stuff', 'n_dim': 128},
)
da2.summary()
Documents Summary
Length 2000
Homogenous Documents True
Common Attributes ('id', 'embedding')
Attributes Summary
Attribute Data type #Unique values Has empty value
─────────────────────────────────────────────────────────────
embedding ('ndarray',) 1000 False
id ('str',) 1000 False
Storage Summary
Backend ElasticSearch
Host http://localhost:9200
Distance cosine
Vector dimension 128
ES config {}
[0.14890289 0.3168339 0.03050802 0.06785086 0.94719299 0.32490566
...]
Other functions behave the same as in-memory DocumentArray.
Bulk request customization#
You can customize how bulk requests is being sent to Elasticsearch when adding documents by adding additional kwargs on extend method call. See the official Documentation for more details. See the following code for example:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='elasticsearch',
config={'n_dim': 3, 'columns': {'price': 'int'}, 'distance': 'l2_norm'},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
],
thread_count=4,
chunk_size=500,
max_chunk_bytes=104857600,
queue_size=4,
)
Note
batch_size configuration will be overriden by chunk_size kwargs if provided
Tip
You can read more about parallel bulk config and their default values here
Vector search with filter query#
You can perform Approximate Nearest Neighbor Search and pre-filter results using a filter query that follows ElasticSearch’s DSL.
Consider Documents with embeddings [0,0,0] up to [9,9,9] where the document with embedding [i,i,i]
has as tag price with value i. We can create such example with the following code:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='elasticsearch',
config={'n_dim': n_dim, 'columns': {'price': 'int'}, 'distance': 'l2_norm'},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
print('\nIndexed Prices:\n')
for embedding, price in zip(da.embeddings, da[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
Consider we want the nearest vectors to the embedding [8. 8. 8.], with the restriction that
prices must follow a filter. As an example, let’s consider that retrieved documents must have price value lower
or equal than max_price. We can encode this information in ElasticSearch using filter = {'range': {'price': {'lte': max_price}}}.
Then the search with the proposed filter can be implemented and used with the following code.
Note
For Elasticsearch, the distance scores can be accessed in the Document’s .scores dictionary under the key 'score'.
max_price = 7
n_limit = 4
np_query = np.ones(n_dim) * 8
print(f'\nQuery vector: \t{np_query}')
filter = {'range': {'price': {'lte': max_price}}}
results = da.find(np_query, filter=filter, limit=n_limit)
print('\nEmbeddings Nearest Neighbours with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This would print:
Embeddings Nearest Neighbours with "price" at most 7:
embedding=[7. 7. 7.], price=7
embedding=[6. 6. 6.], price=6
embedding=[5. 5. 5.], price=5
embedding=[4. 4. 4.], price=4
Additionally you can tune the approximate kNN for speed or accuracy by providing num_candidates kwarg when calling the find method:
results = da.find(np_query, filter=filter, limit=n_limit, num_candidates=100)
Tip
You can read more about approximate kNN tuning here
Search by filter query#
You can search with user-defined query filters using the .find method. Such queries can be constructed following the
guidelines in ElasticSearch’s Documentation.
Consider you store Documents with a certain tag price into ElasticSearch and you want to retrieve all Documents
with price lower or equal to some max_price value.
You can index such Documents as follows:
from docarray import Document, DocumentArray
n_dim = 3
da = DocumentArray(
storage='elasticsearch',
config={
'n_dim': n_dim,
'columns': {'price': 'float'},
},
)
with da:
da.extend([Document(id=f'r{i}', tags={'price': i}) for i in range(10)])
print('\nIndexed Prices:\n')
for price in da[:, 'tags__price']:
print(f'\t price={price}')
Then you can retrieve all documents whose price is lower than or equal to max_price by applying the following
filter:
max_price = 3
n_limit = 4
filter = {
'range': {
'price': {
'lte': max_price,
}
}
}
results = da.find(filter=filter)
print('\n Returned examples that verify filter "price at most 3":\n')
for price in results[:, 'tags__price']:
print(f'\t price={price}')
This would print
Returned examples that satisfy condition "price at most 3":
price=0
price=1
price=2
price=3
Search by .text field#
Text search can be easily leveraged in a DocumentArray with storage='elasticsearch'.
To do this text needs to be indexed using the boolean flag 'index_text' which is set when
the DocumentArray is created with config={'index_text': True, ...}.
The following example builds a DocumentArray with several documents containing text and searches
for those that have pizza in their text description.
from docarray import DocumentArray, Document
da = DocumentArray(storage='elasticsearch', config={'n_dim': 2, 'index_text': True})
with da:
da.extend(
[
Document(text='Person eating'),
Document(text='Person eating pizza'),
Document(text='Pizza restaurant'),
]
)
pizza_docs = da.find('pizza')
pizza_docs[:, 'text']
will print
['Pizza restaurant', 'Person eating pizza']
Config#
The following configs can be set:
Name |
Description |
Default |
|---|---|---|
|
Hostname of the Elasticsearch server |
|
|
Other ES configs in a Dict and pass to |
None |
|
Elasticsearch index name; the class name of Elasticsearch index object to set this DocumentArray |
None |
|
Dimensionality of the embeddings |
None |
|
Similarity metric in Elasticsearch |
|
|
The size of the dynamic list for the nearest neighbors. |
|
|
Similarity metric in Elasticsearch |
|
|
Boolean flag indicating whether to index |
False |
|
List of tags to index |
False |
|
Batch size used to handle storage refreshes/updates |
64 |
|
Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. |
True |
|
Boolean flag indicating whether to store |
True |
Tip
You can read more about HNSW parameters and their default values here
Tip
Note that it is plural hosts not host, to comply with Elasticsearch client’s interface.