Serialization#

DocArray is designed to be “ready-to-wire”: it assumes you always want to send/receive Documents over the network across microservices. This chapter introduces a Document’s multiple serialization methods.

Tip

You should use DocumentArray for serializing multiple Documents, instead of looping over Documents one by one. The former is much faster and yields more compact serialization.

Hint

You may wonder: why isn’t serialization part of constructor? Both are similar. Nonetheless, serialization often contains elements that don’t really fit into constructor: input & output model, data schema, compression, extra-dependencies. We made a decision with DocArray to separate the constructor and serialization for the sake of clarity and maintainability.

From/to JSON#

Tip

If you’re building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, we highly recommend checking out FastAPI/Pydantic and following the methods there.

Important

Depending on which protocol you use, this feature requires pydantic or protobuf dependency. You can do pip install "docarray[full]" to install it.

You can serialize a Document as a JSON string with to_json(), and then read from it with from_json().

from docarray import Document
import numpy as np

d_as_json = Document(text='hello, world', embedding=np.array([1, 2, 3])).to_json()

d = Document.from_json(d_as_json)

print(d_as_json, d)

{"id": "641032d677b311ecb67a1e008a366d49", "parent_id": null, "granularity": null, "adjacency": null, "blob": null, "tensor": null, "mime_type": "text/plain", "text": "hello, world", "weight": null, "uri": null, "tags": null, "offset": null, "location": null, "embedding": [1, 2, 3], "modality": null, "evaluations": null, "scores": null, "chunks": null, "matches": null} 

<Document ('id', 'mime_type', 'text', 'embedding') at 641032d677b311ecb67a1e008a366d49>

By default, Documents use JSON Schema and pydantic model for serialization, i.e. protocol='jsonschema'. To use Protobuf as the JSON serialization backend, pass protocol='protobuf' to the method:

from docarray import Document

d = Document(text='hello, world', embedding=np.array([1, 2, 3]))
d.to_json(protocol='protobuf')

{
  "id": "db66bc2e77b311eca5f51e008a366d49",
  "text": "hello, world",
  "mimeType": "text/plain",
  "embedding": {
    "dense": {
      "buffer": "AQAAAAAAAAACAAAAAAAAAAMAAAAAAAAA",
      "shape": [
        3
      ],
      "dtype": "<i8"
    },
    "clsName": "numpy"
  }
}

When using a RESTful API, you should use protocol='jsonschema' as the resulting JSON will follow a pre-defined schema. This is highly appreciated for modern webservice engineering.

Note that you can pass extra arguments to control field inclusion/exclusion, lower/uppercase. For example, you can remove fields that are empty or none with:

from docarray import Document

d = Document(text='hello, world', embedding=[1, 2, 3])
d.to_json(exclude_none=True)

{"id": "cdbc4f7a77b411ec96ad1e008a366d49", "mime_type": "text/plain", "text": "hello, world", "embedding": [1, 2, 3]}

This is easier on the eyes. But when building a RESTful API, you don’t need to explicitly do this – the pydantic model handles everything for you. More information can be found in FastAPI/Pydantic.

From/to arbitrary JSON#

Arbitrary JSON is unschema-ed JSON. It often comes from handcrafted JSON, or an export file from other libraries. Its schema is unknown to DocArray, so by principle you can’t load it.

But principles be damned. To load an arbitrary JSON file set protocol=None.

As an arbitrary JSON, don’t expect it to work smoothly. DocArray will try its best to parse the fields: by first loading the JSON into a dict object; and then building a Document with Document(dict); when encountering unknown attributes it follows the behavior described here.

As a rule of thumb, if you only work inside DocArray’s ecosystem, always use schema-ed JSON (.to_json(protocol='jsonschema'), or .to_json(protocol='protobuf')) over unschema-ed JSON. If you’re exporting DocArray’s JSON to other ecosystems, also use schema-ed JSON. Your engineer friends will appreciate it as it is easier to integrate. In fact, DocArray does not offer unschema-ed JSON export, so your engineer friends will never be upset.

Read more about JSON Schema support in DocArray.

From/to bytes#

Important

Depending on your protocol and compress argument values, this feature may require protobuf and lz4 dependencies. You can run pip install "docarray[full]" to install it.

Bytes or binary or buffer, however you want to call it, is probably the most common and compact wire format. DocArray provides to_bytes() and from_bytes() to serialize Document objects into bytes.

from docarray import Document
import numpy as np

d = Document(text='hello, world', embedding=np.array([1, 2, 3]))
d_bytes = d.to_bytes()

d_r = Document.from_bytes(d_bytes)

print(d_bytes, d_r)

b'\x80\x03cdocarray.document\nDocument\nq\x00)\x81q\x01}q\x02X\x05\x00\x00\x00_dataq\x03cdocarray.document.data\nDocumentData\nq\x04)\x81q\x05}q\x06(X\x0e\x00\x00\x00_reference_docq\x07h\x01X\x02\x00\x00\x00idq\x08X \x00\x00\x005d29a9f26d5911ec88d51e008a366d49q\tX\t\x00\x00\x00parent_...

<Document ('id', 'mime_type', 'text', 'embedding') at 3644c0fa6d5a11ecbb081e008a366d49>

The default serialization protocol is pickle – you can change it to protobuf by specifying .to_bytes(protocol='protobuf'). You can also add compression to make the resulting bytes smaller:

d = Document(text='hello, world', embedding=np.array([1, 2, 3]))
print(len(d.to_bytes(protocol='protobuf', compress='gzip')))

gives:

whereas the default .to_bytes() gives 666.

Note that when deserializing from a non-default binary serialization, you need to specify the correct protocol and compress arguments used at serialization time:

d = Document.from_bytes(d_bytes, protocol='protobuf', compress='gzip')

Tip

If you go with default protcol and compress settings, you can simply use bytes(d), which is more Pythonic.

From/to base64#

Important

Depending on your protocol and compress argument values, this feature may require protobuf and lz4 dependencies. You can run pip install "docarray[full]" to install it.

Sometimes, such as with RESTful APIs, you can only send/receive strings, not bytes. You can serialize a Document into a base64 string with to_base64() and load it with from_base64().

from docarray import Document
d = Document(text='hello', embedding=[1, 2, 3])

print(d.to_base64())

gANjZG9jYXJyYXkuZG9jdW1lbnQKRG9jdW1lbnQKcQApgXEBfXECWAUAAABfZGF0YXEDY2RvY2FycmF5LmRvY3VtZW50LmRhdGEKRG9jdW1lbnREYXRhCnEEKYFxBX1xBihYDgAAAF9yZWZlcmVuY2VfZG9jcQdoAVgCAAAAaWRxCFggAAAAZmZjNTY3ODg3MzAyMTFlY2E4NjMxZTAwOGEzNjZkNDlxCVgJAAAAcGFyZW50X2lkcQpOWAsAAABncmFudWxhcml0eXELTlgJAAAAYWRqYWNlbmN5cQxOWAYAAABidWZmZXJxDU5YBAAAAGJsb2JxDk5YCQAAAG1pbWVfdHlwZXEPWAoAAAB0ZXh0L3BsYWlucRBYBAAAAHRleHRxEVgFAAAAaGVsbG9xElgHAAAAY29udGVudHETTlgGAAAAd2VpZ2h0cRROWAMAAAB1cmlxFU5YBAAAAHRhZ3NxFk5YBgAAAG9mZnNldHEXTlgIAAAAbG9jYXRpb25xGE5YCQAAAGVtYmVkZGluZ3EZXXEaKEsBSwJLA2VYCAAAAG1vZGFsaXR5cRtOWAsAAABldmFsdWF0aW9uc3EcTlgGAAAAc2NvcmVzcR1OWAYAAABjaHVua3NxHk5YBwAAAG1hdGNoZXNxH051YnNiLg==

You can set protocol and compress to get a more compact string:

from docarray import Document
d = Document(text='hello', embedding=[1, 2, 3])

print(len(d.to_base64()))
print(len(d.to_base64(protocol='protobuf', compress='lz4')))

664
156

Note that you must follow the same protocol and compress when using .from_base64.

From/to dict#

Important

This feature requires the protobuf or pydantic dependency. You can run pip install "docarray[full]" to install it.

You can serialize a Document as a Python dict with to_dict(), and then read from it with from_dict().

from docarray import Document
import numpy as np

d_as_dict = Document(text='hello, world', embedding=np.array([1, 2, 3])).to_dict()

d = Document.from_dict(d_as_dict)

print(d_as_dict, d)

{'id': '5596c84c77b711ecafed1e008a366d49', 'parent_id': None, 'granularity': None, 'adjacency': None, 'blob': None, 'tensor': None, 'mime_type': 'text/plain', 'text': 'hello, world', 'weight': None, 'uri': None, 'tags': None, 'offset': None, 'location': None, 'embedding': [1, 2, 3], 'modality': None, 'evaluations': None, 'scores': None, 'chunks': None, 'matches': None} 

<Document ('id', 'mime_type', 'text', 'embedding') at 5596c84c77b711ecafed1e008a366d49>

As the intermediate step of to_json()/from_json() it’s unlikely you’ll use dict IO directly. Nonetheless, you can pass the same protocol and kwargs as described in From/to JSON to control serialization behavior.

From/to Protobuf#

Important

This feature requires protobuf dependency. You can run pip install "docarray[full]" to install it.

You can also serialize a Document object into a Protobuf Message object. This is used less frequently as it’s often an intermediate step when serializing into bytes, as in to_dict(). However, if you work with Python’s Protobuf API, having a Python Protobuf Message object at hand can be useful.

from docarray import Document

d_proto = Document(uri='apple.jpg').to_protobuf()
print(type(d_proto), d_proto)
d = Document.from_protobuf(d_proto)

<class 'docarray_pb2.DocumentProto'> 

id: "d66463b46d6a11ecbf891e008a366d49"
uri: "apple.jpg"
mime_type: "image/jpeg"

<Document ('id', 'mime_type', 'uri') at e4b215106d6a11ecb28b1e008a366d49>

Refer to the Protobuf specification of Document for details.

When .tensor or .embedding contains a framework-specific ndarray-like object, you can use .to_protobuf(..., ndarray_type='numpy') or .to_protobuf(..., ndarray_type='list') to cast them into list or numpy.ndarray automatically. This helps ensure maximum compatibility between different microservices.

What’s next?#

Serializing a single Document can be useful, but often you want to do things in bulk, say one hundred or one million Documents at once. In that case, looping over each Document and serializing one by one is inefficient. In DocumentArray, we introduce the similar interfaces to_bytes(), to_json(), and to_list() that let you serialize multiple Documents more quickly and compactly.