Process Modality#
So far we’ve learned to construct and select multimodal Documents. Now we’re ready to leverage DocArray API/Jina/Hub Executor to process the modalities.
In a nutshell, you need to convert a multimodal dataclass to a Document object (or DocumentArray) before processing it. This is because DocArray API/Jina/Hub Executor always takes Document/DocumentArray as the basic IO unit:
Embed image and text with CLIP#
Developed by OpenAI, CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It is also a perfect model to showcase multimodal dataclass processing.
Take the code snippet from the original CLIP repository as an example:
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
print(image_features, text_features)
tensor([[-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01, 1.2397e-01]])
tensor([[ 0.0547, -0.0061, 0.0495, ..., -0.6638, -0.1281, -0.4950],
[ 0.1447, 0.0225, -0.2909, ..., -0.4472, -0.3420, 0.1798],
[ 0.1981, -0.2040, -0.1533, ..., -0.4514, -0.5664, 0.0596]])
Let’s refactor it with dataclass:
import clip
import torch
from docarray import dataclass, DocumentArray, Document
from docarray.typing import Image, Text
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
@dataclass
class MMDoc:
title: Text
banner: Image = None
m1 = MMDoc(banner='CLIP.png', title='a diagram')
m2 = MMDoc(banner='CLIP.png', title='a dog')
m3 = MMDoc(banner='CLIP.png', title='a cat')
da = DocumentArray([Document(m1), Document(m2), Document(m3)])
image = preprocess(torch.tensor(da['@.[banner]'].tensors)).to(device)
text = clip.tokenize(da['@.[title]'].texts).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
print(image_features, text_features)
tensor([[-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01],
[-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01],
[-7.3285e-02, -1.6554e-01, ..., -1.3394e-01, -5.5605e-01]])
tensor([[ 0.0547, -0.0061, 0.0495, ..., -0.6638, -0.1281, -0.4950],
[ 0.1447, 0.0225, -0.2909, ..., -0.4472, -0.3420, 0.1798],
[ 0.1981, -0.2040, -0.1533, ..., -0.4514, -0.5664, 0.0596]])
Embed with CLIP-as-service#
CLIP-as-service is a low-latency high-scalability service for embedding images and text. You can easily integrate it into neural search solutions as a microservice.
Using CLIP-as-service to process a dataclass object is simple, which also shows you the idea of using existing Executors or services without touching their codebase.
Construct the dataclass:
from docarray import dataclass, field, Document, DocumentArray from docarray.typing import Text, Image @dataclass class MMDoc: title: Text banner: Image = field(setter=lambda v: Document(uri=v), getter=lambda d: d.uri)
Create multimodal dataclass objects:
m1 = MMDoc(banner='CLIP.png', title='a diagram') m2 = MMDoc(banner='CLIP.png', title='a dog') m3 = MMDoc(banner='CLIP.png', title='a cat')
Convert them into a DocumentArray:
da = DocumentArray([Document(m1), Document(m2), Document(m3)])
Select the modality via the selector syntax and send with client:
from clip_client import Client c = Client('grpc://demo-cas.jina.ai:51000') print(c.encode(da['@.[banner]']).embeddings) print(c.encode(da['@.[title]']).embeddings)
[[ 0.3137 -0.1458 0.303 ... 0.8877 -0.2546 -0.11365]
[ 0.3137 -0.1458 0.303 ... 0.8877 -0.2546 -0.11365]
[ 0.3137 -0.1458 0.303 ... 0.8877 -0.2546 -0.11365]]
[[ 0.05466 -0.005997 0.0498 ... -0.663 -0.1274 -0.4944 ]
[ 0.1442 0.02275 -0.291 ... -0.4468 -0.3416 0.1798 ]
[ 0.1985 -0.204 -0.1534 ... -0.4507 -0.5664 0.0598 ]]