Process Modality#

So far we’ve learned to construct and select multimodal Documents. Now we’re ready to leverage DocArray API/Jina/Hub Executor to process the modalities.

In a nutshell, you need to convert a multimodal dataclass to a Document object (or DocumentArray) before processing it. This is because DocArray API/Jina/Hub Executor always takes Document/DocumentArray as the basic IO unit:

Embed image and text with CLIP#

Developed by OpenAI, CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It is also a perfect model to showcase multimodal dataclass processing.

Take the code snippet from the original CLIP repository as an example:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    print(image_features, text_features)

tensor([[-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01,  1.2397e-01]]) 

tensor([[ 0.0547, -0.0061,  0.0495,  ..., -0.6638, -0.1281, -0.4950],
        [ 0.1447,  0.0225, -0.2909,  ..., -0.4472, -0.3420,  0.1798],
        [ 0.1981, -0.2040, -0.1533,  ..., -0.4514, -0.5664,  0.0596]])

Let’s refactor it with dataclass:

import clip
import torch

from docarray import dataclass, DocumentArray, Document
from docarray.typing import Image, Text

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

@dataclass
class MMDoc:
    title: Text
    banner: Image = None

m1 = MMDoc(banner='CLIP.png', title='a diagram')
m2 = MMDoc(banner='CLIP.png', title='a dog')
m3 = MMDoc(banner='CLIP.png', title='a cat')

da = DocumentArray([Document(m1), Document(m2), Document(m3)])

image = preprocess(torch.tensor(da['@.[banner]'].tensors)).to(device)
text = clip.tokenize(da['@.[title]'].texts).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    print(image_features, text_features)

tensor([[-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01],
        [-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01],
        [-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01]]) 

tensor([[ 0.0547, -0.0061,  0.0495,  ..., -0.6638, -0.1281, -0.4950],
        [ 0.1447,  0.0225, -0.2909,  ..., -0.4472, -0.3420,  0.1798],
        [ 0.1981, -0.2040, -0.1533,  ..., -0.4514, -0.5664,  0.0596]])

Embed with CLIP-as-service#

CLIP-as-service is a low-latency high-scalability service for embedding images and text. You can easily integrate it into neural search solutions as a microservice.

Using CLIP-as-service to process a dataclass object is simple, which also shows you the idea of using existing Executors or services without touching their codebase.

Construct the dataclass:

from docarray import dataclass, field, Document, DocumentArray
from docarray.typing import Text, Image


@dataclass
class MMDoc:
    title: Text
    banner: Image = field(setter=lambda v: Document(uri=v), getter=lambda d: d.uri)

Create multimodal dataclass objects:

m1 = MMDoc(banner='CLIP.png', title='a diagram')
m2 = MMDoc(banner='CLIP.png', title='a dog')
m3 = MMDoc(banner='CLIP.png', title='a cat')

Convert them into a DocumentArray:

da = DocumentArray([Document(m1), Document(m2), Document(m3)])

Select the modality via the selector syntax and send with client:

from clip_client import Client

c = Client('grpc://demo-cas.jina.ai:51000')
print(c.encode(da['@.[banner]']).embeddings)
print(c.encode(da['@.[title]']).embeddings)

[[ 0.3137  -0.1458   0.303   ...  0.8877  -0.2546  -0.11365]
 [ 0.3137  -0.1458   0.303   ...  0.8877  -0.2546  -0.11365]
 [ 0.3137  -0.1458   0.303   ...  0.8877  -0.2546  -0.11365]]

[[ 0.05466  -0.005997  0.0498   ... -0.663    -0.1274   -0.4944  ]
 [ 0.1442    0.02275  -0.291    ... -0.4468   -0.3416    0.1798  ]
 [ 0.1985   -0.204    -0.1534   ... -0.4507   -0.5664    0.0598  ]]