Qdrant#
You can use Qdrant as a document store for DocumentArray. It’s suitable for faster Document retrieval on embeddings, i.e. .match()
, .find()
.
Tip
This feature requires qdrant-client
. You can install it with pip install "docarray[qdrant]".
Usage#
Start Qdrant service#
To use Qdrant as the storage backend, you need a running Qdrant server. You can create docker-compose.yml
to use the Qdrant Docker image:
---
version: '3.4'
services:
qdrant:
image: qdrant/qdrant:v0.10.1
ports:
- "6333:6333"
- "6334:6334"
ulimits: # Only required for tests, as there are a lot of collections created
nofile:
soft: 65535
hard: 65535
...
Then
docker-compose up
Create DocumentArray with Qdrant backend#
Assuming you start the service with the default configuration (i.e. server address is http://localhost:6333
), you can
instantiate a DocumentArray with Qdrant storage like so:
from docarray import DocumentArray
da = DocumentArray(storage='qdrant', config={'n_dim': 10})
The usage is the same as an ordinary DocumentArray.
To access a formerly-persisted DocumentArray, you can specify the collection_name
, host
and port
:
from docarray import DocumentArray
da = DocumentArray(
storage='qdrant',
config={
'collection_name': 'persisted',
'host': 'localhost',
'port': '6333',
'n_dim': 10,
},
)
da.summary()
Note that you must specify n_dim
before using Qdrant as a backend for DocumentArray.
Other functions behave the same as an in-memory DocumentArray.
Configuration#
Name |
Description |
Default |
---|---|---|
|
Number of dimensions of embeddings to be stored and retrieved |
This is always required |
|
Qdrant collection name client |
Random collection name generated |
|
Distance metric to use during search. Can be ‘cosine’ (similarity), ‘dot’ or ‘euclidean’ |
|
|
Hostname of the Qdrant server |
|
|
Port of the Qdrant server |
|
|
Port of the Qdrant gRPC interface |
|
|
Set |
|
|
API key for authentication in Qdrant Cloud |
|
|
Set |
|
|
|
|
|
Batch size used when scrolling over the storage |
|
|
Number of neighbours to consider during the index building. Larger = more accurate search, more time to build index |
|
|
Minimal size (in KiloBytes) of vectors for additional payload-based indexing |
|
|
Number of edges per node in the index graph. Larger = more accurate search, more space required |
|
|
Other fields to store in Document |
|
|
Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. |
True |
|
Boolean flag indicating whether to store |
True |
*You can read more about the HNSW parameters and their default values here
Minimum example#
Create docker-compose.yml
:
---
version: '3.4'
services:
qdrant:
image: qdrant/qdrant:v0.10.1
ports:
- "6333:6333"
- "6334:6334"
ulimits: # Only required for tests, as there are a lot of collections created
nofile:
soft: 65535
hard: 65535
...
pip install -U docarray[qdrant]
docker-compose up
import numpy as np
from docarray import DocumentArray
N, D = 100, 128
da = DocumentArray.empty(
N, storage='qdrant', config={'n_dim': D, 'distance': 'cosine'}
) # init
da.embeddings = np.random.random([N, D])
print(da.find(np.random.random(D), limit=10))
<DocumentArray (length=10) at 4917906896>
Vector search with filter#
Search with .find
can be restricted by user-defined filters. The supported tag types for filter are 'int'
, 'float'
, 'bool'
, 'str'
, 'text'
and 'geo'
as in Qdrant. Such filters can be constructed following the guidelines in Qdrant’s Documentation
Example of .find
with filter#
Let’s create Documents with embeddings [0,0,0]
up to [9,9,9]
, where each Document (which has an embedding [i,i,i]
)
has a tag price
with value i
:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
distance = 'euclidean'
da = DocumentArray(
storage='qdrant',
config={'n_dim': n_dim, 'columns': {'price': 'float'}, 'distance': distance},
)
print(f'\nDocumentArray distance: {distance}')
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
print('\nIndexed Prices:\n')
for embedding, price in zip(da.embeddings, da[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
We want the nearest vectors to the embedding [8. 8. 8.]
, with the restriction that prices must follow a filter. For example, retrieved Documents must have price
value lower than or equal to max_price
. You can encode this information in Qdrant using filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}
. You can also pass additional search_params
following Qdrant’s Search API.
You can then implement and search with the proposed filter:
max_price = 7
n_limit = 4
np_query = np.ones(n_dim) * 8
print(f'\nQuery vector: \t{np_query}')
filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}
results = da.find(np_query, filter=filter, limit=n_limit, search_params={"hnsw_ef": 64})
print('\nEmbeddings Nearest Neighbours with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This prints:
Query vector: [8. 8. 8.]
Embeddings Nearest Neighbours with "price" at most 7:
embedding=[7. 7. 7.], price=7
embedding=[6. 6. 6.], price=6
embedding=[5. 5. 5.], price=5
embedding=[4. 4. 4.], price=4
Note
For Qdrant, the distance scores can be accessed in the Document’s .scores
dictionary by the key f'{distance_metric}_similarity'
. For example, for distance = 'euclidean'
the key would be 'euclidean_similarity'
.
Example of .filter
with a filter#
The following example shows how to use DocArray with Qdrant document store to filter text documents.
Let’s create Documents with the tag price
with a value of i
:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='qdrant',
config={'n_dim': n_dim, 'columns': {'price': 'float'}},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
print('\nIndexed Prices:\n')
for embedding, price in zip(da.embeddings, da[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
If you want to filter only for results
with a price
less than or equal to max_price
, you can encode
this information using filter = {'price': {'$lte': max_price}}
.
You can then implement and search with the proposed filter:
max_price = 7
n_limit = 4
filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}
results = da.filter(filter=filter, limit=n_limit)
print('\nPoints with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This prints:
Points with "price" at most 7:
embedding=[6. 6. 6.], price=6
embedding=[7. 7. 7.], price=7
embedding=[1. 1. 1.], price=1
embedding=[2. 2. 2.], price=2