Qdrant#
You can use Qdrant as a document store for DocumentArray. It’s suitable for faster Document retrieval on embeddings, i.e. .match(), .find().
Tip
This feature requires qdrant-client. You can install it with pip install "docarray[qdrant]".
Usage#
Start Qdrant service#
To use Qdrant as the storage backend, you need a running Qdrant server. You can create docker-compose.yml to use the Qdrant Docker image:
---
version: '3.4'
services:
qdrant:
image: qdrant/qdrant:v0.10.1
ports:
- "6333:6333"
- "6334:6334"
ulimits: # Only required for tests, as there are a lot of collections created
nofile:
soft: 65535
hard: 65535
...
Then
docker-compose up
Create DocumentArray with Qdrant backend#
Assuming you start the service with the default configuration (i.e. server address is http://localhost:6333), you can
instantiate a DocumentArray with Qdrant storage like so:
from docarray import DocumentArray
da = DocumentArray(storage='qdrant', config={'n_dim': 10})
The usage is the same as an ordinary DocumentArray.
To access a formerly-persisted DocumentArray, you can specify the collection_name, host and port:
from docarray import DocumentArray
da = DocumentArray(
storage='qdrant',
config={
'collection_name': 'persisted',
'host': 'localhost',
'port': '6333',
'n_dim': 10,
},
)
da.summary()
Note that you must specify n_dim before using Qdrant as a backend for DocumentArray.
Other functions behave the same as an in-memory DocumentArray.
Configuration#
Name |
Description |
Default |
|---|---|---|
|
Number of dimensions of embeddings to be stored and retrieved |
This is always required |
|
Qdrant collection name client |
Random collection name generated |
|
Distance metric to use during search. Can be ‘cosine’ (similarity), ‘dot’ or ‘euclidean’ |
|
|
Hostname of the Qdrant server |
|
|
Port of the Qdrant server |
|
|
Port of the Qdrant gRPC interface |
|
|
Set |
|
|
API key for authentication in Qdrant Cloud |
|
|
Set |
|
|
|
|
|
Batch size used when scrolling over the storage |
|
|
Number of neighbours to consider during the index building. Larger = more accurate search, more time to build index |
|
|
Minimal size (in KiloBytes) of vectors for additional payload-based indexing |
|
|
Number of edges per node in the index graph. Larger = more accurate search, more space required |
|
|
Other fields to store in Document |
|
|
Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. |
True |
|
Boolean flag indicating whether to store |
True |
*You can read more about the HNSW parameters and their default values here
Minimum example#
Create docker-compose.yml:
---
version: '3.4'
services:
qdrant:
image: qdrant/qdrant:v0.10.1
ports:
- "6333:6333"
- "6334:6334"
ulimits: # Only required for tests, as there are a lot of collections created
nofile:
soft: 65535
hard: 65535
...
pip install -U docarray[qdrant]
docker-compose up
import numpy as np
from docarray import DocumentArray
N, D = 100, 128
da = DocumentArray.empty(
N, storage='qdrant', config={'n_dim': D, 'distance': 'cosine'}
) # init
da.embeddings = np.random.random([N, D])
print(da.find(np.random.random(D), limit=10))
<DocumentArray (length=10) at 4917906896>
Vector search with filter#
Search with .find can be restricted by user-defined filters. The supported tag types for filter are 'int', 'float', 'bool', 'str', 'text' and 'geo' as in Qdrant. Such filters can be constructed following the guidelines in Qdrant’s Documentation
Example of .find with filter#
Let’s create Documents with embeddings [0,0,0] up to [9,9,9], where each Document (which has an embedding [i,i,i])
has a tag price with value i:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
distance = 'euclidean'
da = DocumentArray(
storage='qdrant',
config={'n_dim': n_dim, 'columns': {'price': 'float'}, 'distance': distance},
)
print(f'\nDocumentArray distance: {distance}')
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
print('\nIndexed Prices:\n')
for embedding, price in zip(da.embeddings, da[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
We want the nearest vectors to the embedding [8. 8. 8.], with the restriction that prices must follow a filter. For example, retrieved Documents must have price value lower than or equal to max_price. You can encode this information in Qdrant using filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}. You can also pass additional search_params following Qdrant’s Search API.
You can then implement and search with the proposed filter:
max_price = 7
n_limit = 4
np_query = np.ones(n_dim) * 8
print(f'\nQuery vector: \t{np_query}')
filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}
results = da.find(np_query, filter=filter, limit=n_limit, search_params={"hnsw_ef": 64})
print('\nEmbeddings Nearest Neighbours with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This prints:
Query vector: [8. 8. 8.]
Embeddings Nearest Neighbours with "price" at most 7:
embedding=[7. 7. 7.], price=7
embedding=[6. 6. 6.], price=6
embedding=[5. 5. 5.], price=5
embedding=[4. 4. 4.], price=4
Note
For Qdrant, the distance scores can be accessed in the Document’s .scores dictionary by the key f'{distance_metric}_similarity'. For example, for distance = 'euclidean' the key would be 'euclidean_similarity'.
Example of .filter with a filter#
The following example shows how to use DocArray with Qdrant document store to filter text documents.
Let’s create Documents with the tag price with a value of i:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='qdrant',
config={'n_dim': n_dim, 'columns': {'price': 'float'}},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
print('\nIndexed Prices:\n')
for embedding, price in zip(da.embeddings, da[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
If you want to filter only for results
with a price less than or equal to max_price, you can encode
this information using filter = {'price': {'$lte': max_price}}.
You can then implement and search with the proposed filter:
max_price = 7
n_limit = 4
filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}
results = da.filter(filter=filter, limit=n_limit)
print('\nPoints with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This prints:
Points with "price" at most 7:
embedding=[6. 6. 6.], price=6
embedding=[7. 7. 7.], price=7
embedding=[1. 1. 1.], price=1
embedding=[2. 2. 2.], price=2