Access Attributes#
A DocumentArray itself has no attributes. Accessing attributes in this context means accessing attributes of the contained Documents in bulk.
In the last chapter, we got a taste of DocumentArray’s powerful element selector. This chapter continues talking about the attribute selector.
Attribute selector#
da[element_selector, attribute_selector]
Here the element_selector
s can be any element selector introduced in the last chapter. The attribute selector can be a string, or a list/tuple of string that represents attribute names.
As with element selectors, you can use attribute selectors to get/set/delete attributes in a DocumentArray.
Example |
Return |
---|---|
|
all |
|
all |
|
all |
|
a list of two list, first is all |
|
all |
|
a NdArray-like object of the first three Documents embeddings |
|
a NdArray-like object of the all top-level Documents tensors |
Let’s see an example:
from docarray import DocumentArray
da = DocumentArray().empty(3)
for d in da:
d.chunks = DocumentArray.empty(2)
d.matches = DocumentArray.empty(2)
print(da[:, 'id'])
['8d41ce5c6f0d11eca2181e008a366d49', '8d41cfa66f0d11eca2181e008a366d49', '8d41cff66f0d11eca2181e008a366d49']
Of course you can use it with the path-string selector:
print(da['@c', 'id'])
['db60ab8a6f0d11ec99511e008a366d49', 'db60abda6f0d11ec99511e008a366d49', 'db60c12e6f0d11ec99511e008a366d49', 'db60c1886f0d11ec99511e008a366d49', 'db60c4266f0d11ec99511e008a366d49', 'db60c46c6f0d11ec99511e008a366d49']
print(da[..., 'id'])
['285db6586f0e11ec99401e008a366d49', '285db6b26f0e11ec99401e008a366d49', '285dbff46f0e11ec99401e008a366d49', '285dc0586f0e11ec99401e008a366d49', '285db3606f0e11ec99401e008a366d49', '285dcc746f0e11ec99401e008a366d49', '285dccce6f0e11ec99401e008a366d49', '285dce0e6f0e11ec99401e008a366d49', '285dce5e6f0e11ec99401e008a366d49', '285db4fa6f0e11ec99401e008a366d49', '285dcf946f0e11ec99401e008a366d49', '285dcfda6f0e11ec99401e008a366d49', '285dd1066f0e11ec99401e008a366d49', '285dd16a6f0e11ec99401e008a366d49', '285db55e6f0e11ec99401e008a366d49']
Let’s set the field mime_type
for top-level Documents. We have three top-level Documents, so:
da[:, 'mime_type'] = ['image/jpg', 'image/png', 'image/jpg']
da.summary()
Documents Summary
Length 3
Homogenous Documents True
Has nested Documents in ('chunks', 'matches')
Common Attributes ('id', 'mime_type', 'chunks', 'matches')
Attributes Summary
Attribute Data type #Unique values Has empty value
────────────────────────────────────────────────────────────────
chunks ('ChunkArray',) 3 False
id ('str',) 3 False
matches ('MatchArray',) 3 False
mime_type ('str',) 2 False
You can see the mime_type
is set for each Document.
If you want to set an attribute of all Documents to the same value without looping:
da[:, 'mime_type'] = 'hello'
You can also select multiple attributes in one shot:
da[:, ['mime_type', 'id']]
[['image/jpg', 'image/png', 'image/jpg'], ['095cd76a6f0f11ec82211e008a366d49', '095cd8d26f0f11ec82211e008a366d49', '095cd92c6f0f11ec82211e008a366d49']]
Now let’s remove them:
del da[:, 'mime_type']
da.summary()
Documents Summary
Length 3
Homogenous Documents True
Has nested Documents in ('chunks', 'matches')
Common Attributes ('id', 'chunks', 'matches')
Attributes Summary
Attribute Data type #Unique values Has empty value
────────────────────────────────────────────────────────────────
chunks ('ChunkArray',) 3 False
id ('str',) 3 False
matches ('MatchArray',) 3 False
Auto-ravel on NdArray#
The tensor
and embedding
attribute selectors behave a little differently. Instead of relying on Python List for input and return when get/set, they automatically ravel/unravel the ndarray-like object [1] for you.
Here’s an example, where you may expect that da[:, 'embedding']
gives you a list of three (1, 10)
COO matrices. But it auto-ravels the results and returns them as a (3, 10)
COO matrix:
import numpy as np
import scipy.sparse
from docarray import DocumentArray
# build sparse matrix
sp_embed = np.random.random([3, 10])
sp_embed[sp_embed > 0.1] = 0
sp_embed = scipy.sparse.coo_matrix(sp_embed)
da = DocumentArray.empty(3)
da[:, 'embedding'] = sp_embed
print(type(da[:, 'embedding']), da[:, 'embedding'].shape)
for d in da:
print(type(d.embedding), d.embedding.shape)
<class 'scipy.sparse.coo.coo_matrix'> (3, 10)
<class 'scipy.sparse.coo.coo_matrix'> (1, 10)
<class 'scipy.sparse.coo.coo_matrix'> (1, 10)
<class 'scipy.sparse.coo.coo_matrix'> (1, 10)
Auto-unravel works in a similar way: We just assign a (3, 10)
COO matrix as .embeddings
and it auto-breaks into three and assigns them into the three Documents.
Of course, this isn’t limited to SciPy sparse matrices. Any ndarray-like[1] object will work. The same logic also applies to the .tensors
attribute.
Dunder syntax for nested attributes#
Some attributes are nested by nature, like .tags
and .scores
. Accessing the deep nested value is easy thanks to the dunder (double under) expression. You can access .tags['key1']
via d[:, 'tags__key1']
:
import numpy as np
from docarray import DocumentArray
da = DocumentArray.empty(3)
da.embeddings = np.random.random([3, 2])
da.match(da)
Now to print id
and match score:
print(da['@m', ('id', 'scores__cosine__value')])
[['5164d792709a11ec9ae71e008a366d49', '5164d986709a11ec9ae71e008a366d49', '5164d922709a11ec9ae71e008a366d49', '5164d922709a11ec9ae71e008a366d49', '5164d986709a11ec9ae71e008a366d49', '5164d792709a11ec9ae71e008a366d49', '5164d986709a11ec9ae71e008a366d49', '5164d792709a11ec9ae71e008a366d49', '5164d922709a11ec9ae71e008a366d49'],
[0.0, 0.006942970007385196, 0.48303283924326845, 0.0, 0.3859268166910603, 0.48303283924326845, 2.220446049250313e-16, 0.006942970007385196, 0.3859268166910603]]
Content and embedding sugary attributes#
DocumentArray provides .texts
, .blobs
, .tensors
, .contents
and .embeddings
sugary attributes for quickly accessing the content and embeddings of Documents. You can use them to get/set/delete attributes of all top-level Documents.
from docarray import DocumentArray
da = DocumentArray.empty(2)
da.texts = ['hello', 'world']
print(da.texts)
['hello', 'world']
This is the same as da[:, 'text'] = ['hello', 'world']
followed by print(da[:, 'text'])
, but more compact and probably more Pythonic.
It’s the same for .tensors
and .embeddings
:
import numpy as np
from docarray import DocumentArray
# build sparse matrix
embed = np.random.random([3, 10])
da = DocumentArray.empty(3)
da.embeddings = embed
print(type(da.embeddings), da.embeddings.shape)
for d in da:
print(type(d.embedding), d.embedding.shape)
<class 'numpy.ndarray'> (3, 10)
<class 'numpy.ndarray'> (10,)
<class 'numpy.ndarray'> (10,)
<class 'numpy.ndarray'> (10,)