PyTorch/Deep Learning Frameworks#
DocArray can be easily integrated into the PyTorch, Tensorflow and PaddlePaddle frameworks.
The .embedding
and .tensor
attributes in Document class can contain a PyTorch sparse/dense tensor, Tensorflow sparse/dense tensor or PaddlePaddle dense tensor.
It means that if you store the Document on disk in pickle
or protobuf
with/o compression, or transmit the Document over the network in pickle
or protobuf
without compression, the data type of .embedding
and .tensor
is preserved.
import numpy as np
import paddle
import torch
from docarray import Document, DocumentArray
emb = np.random.random([10, 3])
da = DocumentArray(
[
Document(embedding=emb),
Document(embedding=torch.tensor(emb).to_sparse()),
Document(embedding=torch.tensor(emb)),
Document(embedding=paddle.to_tensor(emb)),
]
)
da.save_binary('test.protobuf.gz')
Now let’s load them again and check the data type:
from docarray import DocumentArray
for d in DocumentArray.load_binary('test.protobuf.gz'):
print(type(d.embedding))
<class 'numpy.ndarray'>
<class 'torch.Tensor'>
<class 'torch.Tensor'>
<class 'paddle.Tensor'>
Load, map, batch in one-shot#
There is a very common pattern in deep learning engineering: loading big data, mapping it via some function for preprocessing on CPU, and batching it to GPU for intensive deep learning stuff.
There are many pitfalls in this pattern when not implemented correctly, to name a few:
Data may not fit into memory.
Mapping via CPU only utilizes a single-core.
Data-draining problem: GPU is not fully utilized as data is blocked by the slow CPU preprocessing step.
DocArray provides a high-level function dataloader()
that allows you to do this in one-shot, avoiding all these pitfalls. The following figure illustrates this function:
Say we have one million 32x32 color images, which takes up 3.14GB on the disk with protocol='protobuf'
and compress='gz'
. To process it:
import time
from docarray import DocumentArray
def cpu_job(da):
time.sleep(2)
print('cpu job done')
return da
def gpu_job(da):
time.sleep(1)
print('gpu job done')
for da in DocumentArray.dataloader(
'da.protobuf.gz', func=cpu_job, batch_size=64, num_worker=4
):
gpu_job(da)
cpu job done
cpu job done
cpu job done
cpu job done
GPU job done
cpu job done
cpu job done
GPU job done
cpu job done
cpu job done
GPU job done
cpu job done
GPU job done
cpu job done
cpu job done