Access Attributes#

Use . expression to get/set the value of an attribute as you would with any Python object:

from docarray import Document

d = Document()
d.text = 'hello world'

print(d.text)
hello world

To unset an attribute, assign it to None:

d.text = None

Or use pop():

d.pop('text')

You can unset multiple attributes with .pop():

d.pop('text', 'id', 'mime_type')

You can check which attributes are set by .non_empty_fields.

Content attributes#

Among all attributes, the most important are content attributes, namely .text, .tensor, and .blob. They contain the actual content.

See also

If you’re working with a Document that was created through DocArray’s dataclass API, you can not only access the attributes described here, but also attributes that you defined yourself.

To see how to do that, see here.

They correspond to string-like data (e.g. for natural language), ndarray-like data (e.g. for image/audio/video data), and binary data (for general purpose), respectively.

Attribute

Accept type

Use case

doc.text

Python string

Contain text

doc.tensor

A Python (nested) list/tuple of numbers, Numpy ndarray, SciPy sparse matrix (spmatrix), TensorFlow dense & sparse tensor, PyTorch dense & sparse tensor, PaddlePaddle dense tensor

Contain image/video/audio

doc.blob

Binary string

Contain intermediate IO buffer

Each Document can contain only one type of content. That means these three attributes are mutually exclusive. Let’s see an example:

import numpy as np
from docarray import Document

d = Document(text='hello')
d.tensor = np.array([1, 2, 3])

print(d)
<Document ('id', 'tensor', 'mime_type') at 7623808c6d6211ec9cf21e008a366d49>

As you can see, the text field is reset to empty.

But what if you want to represent more than one kind of information? Say, to fully represent a PDF page you need to store both image and text. In this case, you can use nested Documents and store the image in one sub-Document, and text in another sub-Document.

from docarray import Document

d = Document(chunks=[Document(tensor=...), Document(text=...)])

The principle is: Each Document contains only one modality of information. In practice, this makes your full solution clearer and easier to maintain.

There’s also a .content getter/setter for the content fields. Content is automatically grabbed or assigned to either the text, blob, or tensor field, based on the given type.

from docarray import Document

d = Document(content='hello')
print(d)
<Document ('id', 'mime_type', 'text') at b4d675466d6211ecae8d1e008a366d49>
d.content = [1, 2, 3]
print(d)
<Document ('id', 'tensor', 'mime_type') at 2808eeb86d6311ecaddb1e008a366d49>

You can also check which content field is set by .content_type.

Load content from a URI#

A common pattern is loading content from a URI instead of assigning it directly in the code.

You can do this with the .uri attribute. The value of .uri can point to either a local URI, remote URI or data URI.

from docarray import Document

d1 = Document(uri='apple.png').load_uri_to_image_tensor()
print(d1.content_type, d1.content)
tensor [[[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
from docarray import Document

d1 = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()

print(d1.content_type, d1.content)
text The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

This eBook is for the use of anyone anywhere in the United States and
most other parts of the wor
from docarray import Document

d1 = Document(
    uri='''data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==
'''
).load_uri_to_image_tensor()

print(d1.content_type, d1.content)
tensor [[[255 255 255]
  [255   0   0]
  [255   0   0]
  [255   0   0]
  [255 255 255]]
  ...

There are more .load_uri_to_* functions that allow you to read text, image, video, 3D mesh, audio and tabular data.

../../../_images/doc-load-autocomplete.png

Convert content to data URI

An inline data URI is helpful when you need a quick visualization in HTML, as it embeds all resources directly into that HTML.

You can convert a URI to a data URI using doc.convert_uri_to_datauri(). This fetches the resource and makes it inline.