Embed via Neural Network#

Important

embed() supports both CPU & GPU.

When DocumentArray has .tensors set, you can use a neural network to embed() it into vector representations, i.e. filling .embeddings. For example, let’s assume we have the following DocumentArray:

from docarray import DocumentArray
import numpy as np

docs = DocumentArray.empty(10)
docs.tensors = np.random.random([10, 128]).astype(np.float32)

Let’s use a simple MLP in PyTorch/Keras/ONNX/Paddle as our embedding model:

PyTorch

import torch

model = torch.nn.Sequential(
    torch.nn.Linear(
        in_features=128,
        out_features=128,
    ),
    torch.nn.ReLU(),
    torch.nn.Linear(in_features=128, out_features=32))

Keras

import tensorflow as tf

model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(32),
    ]
)

ONNX

Preliminary: you need to first export a DNN model to ONNX via API/CLI. For example let’s use the PyTorch one:

data = torch.rand(1, 128)

torch.onnx.export(model, data, 'mlp.onnx', 
    do_constant_folding=True,  # whether to execute constant folding for optimization
    input_names=['input'],  # the model's input names
    output_names=['output'],  # the model's output names
    dynamic_axes={
        'input': {0: 'batch_size'},  # variable length axes
        'output': {0: 'batch_size'},
    })

Then load it as InferenceSession:

import onnxruntime

model = onnxruntime.InferenceSession('mlp.onnx')

Paddle

import paddle

model = paddle.nn.Sequential(
    paddle.nn.Linear(
        in_features=128,
        out_features=128,
    ),
    paddle.nn.ReLU(),
    paddle.nn.Linear(in_features=128, out_features=32),
)

Now, you can create the embeddings:

docs.embed(model)

print(docs.embeddings)

tensor([[-0.1234,  0.0506, -0.0015,  0.1154, -0.1630, -0.2376,  0.0576, -0.4109,
          0.0052,  0.0027,  0.0800, -0.0928,  0.1326, -0.2256,  0.1649, -0.0435,
         -0.2312, -0.0068, -0.0991,  0.0767, -0.0501, -0.1393,  0.0965, -0.2062,

By default, the filled .embeddings are in the given model framework’s format. If you want them to always be numpy.ndarray, use .embed(..., to_numpy=True).

You can specify .embed(..., device='cuda') when working with a GPU. The device name identifier depends on the model framework that you’re using.

On large DocumentArrays that don’t fit into GPU memory, you can set batch_size with .embed(..., batch_size=128).

You can use a pretrained model from Keras/PyTorch/PaddlePaddle/ONNX for embedding:

import torchvision
model = torchvision.models.resnet50(pretrained=True)
docs.embed(model)

After getting .embeddings, you can visualize them using plot_embeddings(), find more details here.

Note that .embed() only works when you have .tensors set, if you have .texts set and your model function supports strings as input, then you can do the following to generate embeddings:

from docarray import DocumentArray

da = DocumentArray(...)
da.embeddings = my_text_model(da.texts)