Text#

Representing text in DocArray is as easy as:

from docarray import Document

Document(text='hello, world.')

If your text data is larger and can’t be written inline, or comes from a URI, then you can also define uri first and load the text into a Document later:

from docarray import Document

d = Document(uri='https://www.w3.org/History/19921103-hypertext/hypertext/README.html')
d.load_uri_to_text()

d.summary()

<Document ('id', 'mime_type', 'text', 'uri') at 3c128f326fbf11ec90821e008a366d49>

And of course, you can use characters from different languages:

from docarray import Document

d = Document(text='👋	नमस्ते दुनिया!	你好世界！こんにちは世界！	Привет мир!')

Segment long Documents#

Often times when you index/search textual Documents, you don’t want to consider thousands of words as one huge Document – some finer granularity would be nice. You can do this by leveraging Document chunks. For example, let’s split this simple Document at each ! mark:

from docarray import Document

d = Document(text='👋	नमस्ते दुनिया!	你好世界!こんにちは世界!	Привет мир!')

d.chunks.extend([Document(text=c) for c in d.text.split('!')])

d.summary()

 <Document ('id', 'mime_type', 'text', 'chunks') at 5a12d7a86fbf11ec99a21e008a366d49>
    └─ chunks
          ├─ <Document ('id', 'mime_type', 'text') at 5a12e2346fbf11ec99a21e008a366d49>
          ├─ <Document ('id', 'mime_type', 'text') at 5a12e2f26fbf11ec99a21e008a366d49>
          ├─ <Document ('id', 'mime_type', 'text') at 5a12e3886fbf11ec99a21e008a366d49>
          ├─ <Document ('id', 'mime_type', 'text') at 5a12e41e6fbf11ec99a21e008a366d49>
          └─ <Document ('id',) at 5a12e4966fbf11ec99a21e008a366d49>

This creates five sub-Documents under the original Document and stores them under the original Document’s .chunks.

Convert text to `ndarray`#

Sometimes you need to encode the text into a numpy.ndarray before further computation. We provide some helper functions in Document and DocumentArray that allow you to do that easily.

For example, we have a DocumentArray with three Documents:

from docarray import DocumentArray, Document

da = DocumentArray(
    [
        Document(text='hello world'),
        Document(text='goodbye world'),
        Document(text='hello goodbye'),
    ]
)

To get the vocabulary, you can use:

vocab = da.get_vocabulary()

{'hello': 2, 'world': 3, 'goodbye': 4}

The vocabulary is 2-indexed as 0 is reserved for the padding symbol and 1 for the unknown symbol.

You can further use this vocabulary to convert .text field into .tensor:

for d in da:
    d.convert_text_to_tensor(vocab)
    print(d.tensor)

[2 3]
[4 3]
[2 4]

When you have text of different lengths and want output .tensors to have the same length, you can define max_length during conversion:

from docarray import Document, DocumentArray

da = DocumentArray(
    [
        Document(text='a short phrase'),
        Document(text='word'),
        Document(text='this is a much longer sentence'),
    ]
)
vocab = da.get_vocabulary()

for d in da:
    d.convert_text_to_tensor(vocab, max_length=10)
    print(d.tensor)

[0 0 0 0 0 0 0 2 3 4]
[0 0 0 0 0 0 0 0 0 5]
[ 0  0  0  0  6  7  2  8  9 10]

You can get also use a DocumentArray’s .tensors to get all tensors in one ndarray.

print(da.tensors)

[[ 0  0  0  0  0  0  0  2  3  4]
 [ 0  0  0  0  0  0  0  0  0  5]
 [ 0  0  0  0  6  7  2  8  9 10]]

Convert `ndarray` back to text#

As a bonus, you can also easily convert an integer ndarray back to text based on a given vocabulary. This is often termed “decoding”.

from docarray import Document, DocumentArray

da = DocumentArray(
    [
        Document(text='a short phrase'),
        Document(text='word'),
        Document(text='this is a much longer sentence'),
    ]
)
vocab = da.get_vocabulary()

# encoding
for d in da:
    d.convert_text_to_tensor(vocab, max_length=10)

# decoding
for d in da:
    d.convert_tensor_to_text(vocab)
    print(d.text)

a short phrase
word
this is a much longer sentence

Simple text matching with feature hashing#

Let’s search for "she entered the room" in Pride and Prejudice:

SciPy

The example below uses SciPy to speed up computations. To install SciPy you can run pip install scipy, or install it together with other optional dependencies using pip install "docarray[full]".

Alternatively, you can run the example below without SciPy by setting use_scipy=False in the .match() method.

from docarray import Document, DocumentArray

d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())
da.apply(lambda d: d.embed_feature_hashing())

q = (
    Document(text='she entered the room')
    .embed_feature_hashing()
    .match(da, limit=5, exclude_self=True, metric='jaccard', use_scipy=True)
)

print(q.matches[:, ('text', 'scores__jaccard')])

[['staircase, than she entered the breakfast-room, and congratulated', 
'of the room.', 
'She entered the room with an air more than usually ungracious,', 
'entered the breakfast-room, where Mrs. Bennet was alone, than she', 
'those in the room.'], 
[{'value': 0.6, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}, 
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}, 
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}, 
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}, 
{'value': 0.7142857142857143, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}]]

Searching at chunk level with subindex#

You can create applications that search at chunk level using a subindex. Imagine you want an application that searches at a sentence granularity and returns the title of the Document containing the closest sentence to the query. For example, you have a database of song lyrics and want to search a title from which you remember a small part of the lyrics (like the chorus).

Multi-modal Documents

Modelling nested Documents is often more convenient using DocArray’s dataclass API, especially when multiple modalities are involved.

You can find the corresponding example here.

song1_title = 'Old MacDonald Had a Farm'

song1 = """
Old MacDonald had a farm, E-I-E-I-O
And on that farm he had some dogs, E-I-E-I-O
With a bow-wow here, and a bow-wow there,
Here a bow, there a bow, everywhere a bow-wow.
"""

song2_title = 'Ode an die Freude'

song2 = """
Freude, schöner Götterfunken,
Tochter aus Elisium,
Wir betreten feuertrunken
Himmlische, dein Heiligthum.
Deine Zauber binden wieder,
was der Mode Schwerd getheilt;
Bettler werden Fürstenbrüder,
wo dein sanfter Flügel weilt.
"""

We can create one Document for each song, containing the song’s lines as chunks:

from docarray import Document, DocumentArray

doc1 = Document(
    chunks=[Document(text=line) for line in song1.split('\n')], song_title=song1_title
)
doc2 = Document(
    chunks=[Document(text=line) for line in song2.split('\n')], song_title=song2_title
)
da = DocumentArray()
da.extend([doc1, doc2])

Now we can build a feature vector for each line of each song. Here we use a very simple Bag of Words descriptor as the feature vector.

import re


def build_tokenizer(token_pattern=r"(?u)\b\w\w+\b"):
    token_pattern = re.compile(token_pattern)
    return token_pattern.findall


def bow_feature_vector(d, vocab, tokenizer):
    embedding = np.zeros(len(vocab) + 2)
    tokens = tokenizer(d.text)
    for token in tokens:
        if token in vocab:
            embedding[vocab.get(token)] += 1

    return embedding


tokenizer = build_tokenizer()
vocab = da['@c'].get_vocabulary()
for d in da['@c']:
    d.embedding = bow_feature_vector(d, vocab, tokenizer)

Once we’ve prepared the data, we can store it in a DocumentArray that supports a subindex:

n_features = len(vocab)+2
n_dim = 3
da_backend=DocumentArray(
    storage='annlite',
    config={'data_path':'./annlite_data',
            'n_dim': n_dim, 
            'metric': 'Cosine'},
    subindex_configs={'@c': {'n_dim': n_features}},
)

with da_backend:
    da_backend.extend(da)

Given a query like into death we want to search songs that contain a similar sentence.

def find_song_name_from_song_snippet(query: Document, da_backend) -> str:
    similar_items = da_backend.find(query=query, on='@c', limit=10)[0]
    most_similar_docs = similar_items[0]
    return da_backend[most_similar_docs.parent_id].tags


query = Document(text='farm')
query.embedding = bow_feature_vector(query, vocab, tokenizer)

similar_items = find_song_name_from_song_snippet(query, da_backend)
print(similar_items)

This prints:

{'song_title': 'Old MacDonald Had a Farm'}