docarray.document.mixins.text module#
- class docarray.document.mixins.text.TextDataMixin[source]#
Bases:
object
Provide helper functions for
Document
to support text data.- load_uri_to_text(charset='utf-8', **kwargs)[source]#
Convert
uri
to :attr`.text` inplace.- Parameters:
charset (
str
) – charset may be any character set registered with IANAkwargs – keyword arguments to pass to :meth:_uri_to_blob such as timeout
- Return type:
T
- Returns:
itself after processed
- get_vocabulary(text_attrs=('text',))[source]#
Get the text vocabulary in a counter dict that maps from the word to its frequency from all
text_fields
.- Parameters:
text_attrs (
Tuple
[str
,...
]) – the textual attributes where vocabulary will be derived from- Return type:
Dict
[str
,int
]- Returns:
a vocabulary in dictionary where key is the word, value is the frequency of that word in all text fields.
- convert_text_to_tensor(vocab, max_length=None, dtype='int64')[source]#
Convert
text
totensor
inplace.In the end
tensor
will be a 1D array where D is max_length.To get the vocab of a DocumentArray, you can use jina.types.document.converters.build_vocab to
- Parameters:
vocab (
Dict
[str
,int
]) – a dictionary that maps a word to an integer index, 0 is reserved for padding, 1 is reserved for unknown words intext
. So you should not include these two entries in vocab.max_length (
Optional
[int
]) – the maximum length of the sequence. Sequence longer than this are cut off from beginning. Sequence shorter than this will be padded with 0 from right hand side.dtype (
str
) – the dtype of the generatedtensor
- Return type:
T
- Returns:
Document itself after processed
- convert_tensor_to_text(vocab, delimiter=' ')[source]#
Convert
tensor
totext
inplace.- Parameters:
- Return type:
T
- Returns:
Document itself after processed
- convert_text_to_datauri(charset='utf-8', base64=False)[source]#
-
- Parameters:
charset (
str
) – charset may be any character set registered with IANAbase64 (
bool
) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.
- Return type:
T
- Returns:
itself after processed