docarray.array.mixins.text module#

class docarray.array.mixins.text.TextToolsMixin[source]#

Bases: object

Help functions used in NLP for DA and DAM

get_vocabulary(min_freq=1, text_attrs=('text',))[source]#

Get the text vocabulary in a dict that maps from the word to the index from all Documents.

Parameters:
  • text_attrs (Tuple[str, ...]) – the textual attributes where vocabulary will be derived from

  • min_freq (int) – the minimum word frequency to be considered into the vocabulary.

Return type:

Dict[str, int]

Returns:

a vocabulary in dictionary where key is the word, value is the index. The value is 2-index, where 0 is reserved for padding, 1 is reserved for unknown token.