Construct#
Initializing a Document object is easy. This chapter introduces the ways of constructing both empty and filled Documents. You can also construct Documents from bytes, JSON, or Protobuf message as introduced in the next chapter.
Construct an empty Document#
from docarray import Document
d = Document()
<Document ('id',) at 5dd542406d3f11eca3241e008a366d49>
Each Document has a unique random id
to identify it. It can be used to access the Document inside a DocumentArray.
Tip
The random id
is the hex value of UUID1. To convert it into the a UUID string:
import uuid
str(uuid.UUID(d.id))
Though possible, we don’t recommended modifying the .id
of a Document frequently, as this leads to unexpected behavior.
Construct with attributes#
This is the constructor’s most common use: initializing a Document object with the given attributes:
from docarray import Document
import numpy
d1 = Document(text='hello')
d2 = Document(blob=b'\f1')
d3 = Document(tensor=numpy.array([1, 2, 3]))
d4 = Document(
uri='https://docs.docarray.org',
mime_type='text/plain',
granularity=1,
adjacency=3,
tags={'foo': 'bar'},
)
Don’t forget to leverage autocomplete in your IDE.
<Document ('id', 'mime_type', 'text') at a14effee6d3e11ec8bde1e008a366d49>
<Document ('id', 'blob') at a14f00986d3e11ec8bde1e008a366d49>
<Document ('id', 'tensor') at a14f01a66d3e11ec8bde1e008a366d49>
<Document ('id', 'granularity', 'adjacency', 'mime_type', 'uri', 'tags') at a14f023c6d3e11ec8bde1e008a366d49>
Tip
When you print()
a Document, you get a string representation like <Document ('id', 'tensor') at a14f01a66d3e11ec8bde1e008a366d49>
. This shows the Document’s non-empty attributes as well as its id
. All of this helps you understand the content of that Document.
<Document ('id', 'tensor') at a14f01a66d3e11ec8bde1e008a366d49>
^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| |
non-empty fields |
Document.id
You can also wrap keyword arguments into a dict
. The following ways of initialization have the same effect:
d1 = Document(
uri='https://docs.docarray.org', mime_type='text/plain', granularity=1, adjacency=3
)
d2 = Document(
dict(
uri='https://docs.docarray.org',
mime_type='text/plain',
granularity=1,
adjacency=3,
)
)
d3 = Document(
{
'uri': 'https://docs.docarray.org',
'mime_type': 'text/plain',
'granularity': 1,
'adjacency': 3,
}
)
Nested Document#
See also
This section describes how to manually construct a nested Document, for example to hold different modalities, like text and image.
To construct multimodal Documents in a more comfortabe, readable, and idiomatic way you should use DocArray’s dataclass API.
To learn more about nested Documents, please read Nested Structure.
Documents can be nested inside .chunks
and .matches
. You can specify this nested structure directly during construction:
from docarray import Document
d = Document(
id='d0',
chunks=[Document(id='d1', chunks=Document(id='d2'))],
matches=[Document(id='d3')],
)
print(d)
<Document ('id', 'chunks', 'matches') at d0>
For a nested Document, printing its root doesn’t give much information. Instead, you can use summary()
– for example, d.summary()
gives a more intuitive overview of the Document’s structure.
<Document ('id', 'chunks', 'matches') at d0>
└─ matches
└─ <Document ('id',) at d3>
└─ chunks
└─ <Document ('id', 'chunks') at d1>
└─ chunks
└─ <Document ('id', 'parent_id', 'granularity') at d2>
When using in Jupyter notebook/Google Colab, Documents are automatically prettified.
Unknown attribute handling#
If you give an unknown attribute (i.e. not one of the built-in Document attributes), it is automatically “caught” into the .tags
attribute. For example:
from docarray import Document
d = Document(hello='world')
print(d, d.tags)
<Document ('id', 'tags') at f957e84a6d4311ecbea21e008a366d49>
{'hello': 'world'}
You can change this catch
behavior to drop
(silently drop unknown attributes) or raise
(raise an AttributeError
) by specifying unknown_fields_handler
.
Resolve unknown attributes with rules#
You can resolve external fields into built-in attributes by specifying a mapping in field_resolver
. For example, to resolve the field hello
as the id
attribute:
from docarray import Document
d = Document(hello='world', field_resolver={'hello': 'id'})
print(d)
<Document ('id',) at world>
You can see id
of the Document object is set to world
.
Copy from another Document#
To make a deep copy of a Document, use copy=True
:
from docarray import Document
d = Document(text='hello')
d1 = Document(d, copy=True)
print(d == d1, id(d) == id(d1))
True False
This indicates d
and d1
have identical content, but they are different objects in memory.
If you want to keep the memory address of a Document object while only copying the content from another Document, you can use copy_from()
.
from docarray import Document
d1 = Document(text='hello')
d2 = Document(text='world')
print(id(d1))
d1.copy_from(d2)
print(d1.text)
print(id(d1))
4479829968
world
4479829968
What’s next?#
You can also construct Documents from bytes, JSON, and Protobuf message. These methods are introduced in the next chapter.