#
TextRepresenting text in DocArray is as easy as:
from docarray import Document
Document(text='hello, world.')
If your text data is larger and can’t be written inline, or comes from a URI, then you can also define uri
first and load the text into a Document later:
from docarray import Document
d = Document(uri='https://www.w3.org/History/19921103-hypertext/hypertext/README.html')
d.load_uri_to_text()
d.summary()
<Document ('id', 'mime_type', 'text', 'uri') at 3c128f326fbf11ec90821e008a366d49>
And of course, you can use characters from different languages:
from docarray import Document
d = Document(text='👋 नमस्ते दुनिया! 你好世界!こんにちは世界! Привет мир!')
Segment long Documents#
Often times when you index/search textual Documents, you don’t want to consider thousands of words as one huge Document – some finer granularity would be nice. You can do this by leveraging Document chunks
. For example, let’s split this simple Document at each !
mark:
from docarray import Document
d = Document(text='👋 नमस्ते दुनिया! 你好世界!こんにちは世界! Привет мир!')
d.chunks.extend([Document(text=c) for c in d.text.split('!')])
d.summary()
<Document ('id', 'mime_type', 'text', 'chunks') at 5a12d7a86fbf11ec99a21e008a366d49>
└─ chunks
├─ <Document ('id', 'mime_type', 'text') at 5a12e2346fbf11ec99a21e008a366d49>
├─ <Document ('id', 'mime_type', 'text') at 5a12e2f26fbf11ec99a21e008a366d49>
├─ <Document ('id', 'mime_type', 'text') at 5a12e3886fbf11ec99a21e008a366d49>
├─ <Document ('id', 'mime_type', 'text') at 5a12e41e6fbf11ec99a21e008a366d49>
└─ <Document ('id',) at 5a12e4966fbf11ec99a21e008a366d49>
This creates five sub-Documents under the original Document and stores them under the original Document’s .chunks
.
Convert text to ndarray
#
Sometimes you need to encode the text into a numpy.ndarray
before further computation. We provide some helper functions in Document and DocumentArray that allow you to do that easily.
For example, we have a DocumentArray with three Documents:
from docarray import DocumentArray, Document
da = DocumentArray(
[
Document(text='hello world'),
Document(text='goodbye world'),
Document(text='hello goodbye'),
]
)
To get the vocabulary, you can use:
vocab = da.get_vocabulary()
{'hello': 2, 'world': 3, 'goodbye': 4}
The vocabulary is 2-indexed as 0
is reserved for the padding symbol and 1
for the unknown symbol.
You can further use this vocabulary to convert .text
field into .tensor
:
for d in da:
d.convert_text_to_tensor(vocab)
print(d.tensor)
[2 3]
[4 3]
[2 4]
When you have text of different lengths and want output .tensor
s to have the same length, you can define max_length
during conversion:
from docarray import Document, DocumentArray
da = DocumentArray(
[
Document(text='a short phrase'),
Document(text='word'),
Document(text='this is a much longer sentence'),
]
)
vocab = da.get_vocabulary()
for d in da:
d.convert_text_to_tensor(vocab, max_length=10)
print(d.tensor)
[0 0 0 0 0 0 0 2 3 4]
[0 0 0 0 0 0 0 0 0 5]
[ 0 0 0 0 6 7 2 8 9 10]
You can get also use a DocumentArray’s .tensors
to get all tensors in one ndarray
.
print(da.tensors)
[[ 0 0 0 0 0 0 0 2 3 4]
[ 0 0 0 0 0 0 0 0 0 5]
[ 0 0 0 0 6 7 2 8 9 10]]
Convert ndarray
back to text#
As a bonus, you can also easily convert an integer ndarray
back to text based on a given vocabulary. This is often termed “decoding”.
from docarray import Document, DocumentArray
da = DocumentArray(
[
Document(text='a short phrase'),
Document(text='word'),
Document(text='this is a much longer sentence'),
]
)
vocab = da.get_vocabulary()
# encoding
for d in da:
d.convert_text_to_tensor(vocab, max_length=10)
# decoding
for d in da:
d.convert_tensor_to_text(vocab)
print(d.text)
a short phrase
word
this is a much longer sentence
Simple text matching with feature hashing#
Let’s search for "she entered the room"
in Pride and Prejudice:
SciPy
The example below uses SciPy to speed up computations. To install SciPy you can run pip install scipy
, or install
it together with other optional dependencies using pip install "docarray[full]"
.
Alternatively, you can run the example below without SciPy by setting use_scipy=False
in the .match()
method.
from docarray import Document, DocumentArray
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())
da.apply(lambda d: d.embed_feature_hashing())
q = (
Document(text='she entered the room')
.embed_feature_hashing()
.match(da, limit=5, exclude_self=True, metric='jaccard', use_scipy=True)
)
print(q.matches[:, ('text', 'scores__jaccard')])
[['staircase, than she entered the breakfast-room, and congratulated',
'of the room.',
'She entered the room with an air more than usually ungracious,',
'entered the breakfast-room, where Mrs. Bennet was alone, than she',
'those in the room.'],
[{'value': 0.6, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.7142857142857143, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}]]
Searching at chunk level with subindex#
You can create applications that search at chunk level using a subindex. Imagine you want an application that searches at a sentence granularity and returns the title of the Document containing the closest sentence to the query. For example, you have a database of song lyrics and want to search a title from which you remember a small part of the lyrics (like the chorus).
Multi-modal Documents
Modelling nested Documents is often more convenient using DocArray’s dataclass API, especially when multiple modalities are involved.
You can find the corresponding example here.
song1_title = 'Old MacDonald Had a Farm'
song1 = """
Old MacDonald had a farm, E-I-E-I-O
And on that farm he had some dogs, E-I-E-I-O
With a bow-wow here, and a bow-wow there,
Here a bow, there a bow, everywhere a bow-wow.
"""
song2_title = 'Ode an die Freude'
song2 = """
Freude, schöner Götterfunken,
Tochter aus Elisium,
Wir betreten feuertrunken
Himmlische, dein Heiligthum.
Deine Zauber binden wieder,
was der Mode Schwerd getheilt;
Bettler werden Fürstenbrüder,
wo dein sanfter Flügel weilt.
"""
We can create one Document for each song, containing the song’s lines as chunks:
from docarray import Document, DocumentArray
doc1 = Document(
chunks=[Document(text=line) for line in song1.split('\n')], song_title=song1_title
)
doc2 = Document(
chunks=[Document(text=line) for line in song2.split('\n')], song_title=song2_title
)
da = DocumentArray()
da.extend([doc1, doc2])
Now we can build a feature vector for each line of each song. Here we use a very simple Bag of Words descriptor as the feature vector.
import re
def build_tokenizer(token_pattern=r"(?u)\b\w\w+\b"):
token_pattern = re.compile(token_pattern)
return token_pattern.findall
def bow_feature_vector(d, vocab, tokenizer):
embedding = np.zeros(len(vocab) + 2)
tokens = tokenizer(d.text)
for token in tokens:
if token in vocab:
embedding[vocab.get(token)] += 1
return embedding
tokenizer = build_tokenizer()
vocab = da['@c'].get_vocabulary()
for d in da['@c']:
d.embedding = bow_feature_vector(d, vocab, tokenizer)
Once we’ve prepared the data, we can store it in a DocumentArray that supports a subindex:
n_features = len(vocab)+2
n_dim = 3
da_backend=DocumentArray(
storage='annlite',
config={'data_path':'./annlite_data',
'n_dim': n_dim,
'metric': 'Cosine'},
subindex_configs={'@c': {'n_dim': n_features}},
)
with da_backend:
da_backend.extend(da)
Given a query like into death
we want to search songs that contain a similar sentence.
def find_song_name_from_song_snippet(query: Document, da_backend) -> str:
similar_items = da_backend.find(query=query, on='@c', limit=10)[0]
most_similar_docs = similar_items[0]
return da_backend[most_similar_docs.parent_id].tags
query = Document(text='farm')
query.embedding = bow_feature_vector(query, vocab, tokenizer)
similar_items = find_song_name_from_song_snippet(query, da_backend)
print(similar_items)
This prints:
{'song_title': 'Old MacDonald Had a Farm'}