docarray.document.generators module#

docarray.document.generators.from_ndarray(array, axis=0, size=None, shuffle=False, *args, **kwargs)[source]#

Create a generator for a given dimension of a numpy array.

Parameters:

array (ndarray) – the numpy ndarray data source
axis (int) – iterate over that axis
size (Optional[int]) – the maximum number of the sub arrays
shuffle (bool) – shuffle the numpy data source beforehand

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_files(patterns, recursive=True, size=None, sampling_rate=None, read_mode=None, to_dataturi=False, exclude_regex=None, *args, **kwargs)[source]#

Creates an iterator over a list of file path or the content of the files.

Parameters:

patterns (Union[str, List[str]]) – The pattern may contain simple shell-style wildcards, e.g. ‘*.py’, ‘[*.zip, *.gz]’
recursive (bool) – If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectories
size (Optional[int]) – the maximum number of the files
sampling_rate (Optional[float]) – the sampling rate between [0, 1]
read_mode (Optional[str]) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary mode. If read_mode is None, will iterate over filenames.
to_dataturi (bool) – if set, then the Document.uri will be filled with DataURI instead of the plan URI
exclude_regex (Optional[str]) – if set, then filenames that match to this pattern are not included.

Yield:

file paths or binary content

Note

This function should not be directly used, use Flow.index_files(), Flow.search_files() instead

Return type:: Generator[Document, None, None]

docarray.document.generators.from_csv(file, field_resolver=None, size=None, sampling_rate=None, dialect='excel', encoding='utf-8', *args, **kwargs)[source]#

Generator function for CSV. Yields documents.

Parameters:

file (Union[str, TextIO]) – file paths or file handler
field_resolver (Optional[Dict[str, str]]) – a map from field names defined in JSON, dict to the field names defined in Document.
size (Optional[int]) – the maximum number of the documents
sampling_rate (Optional[float]) – the sampling rate between [0, 1]
dialect (Union[str, Dialect]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be a csv.Dialect class that groups specific formatting parameters together. If you don’t know the dialect and the default one does not work for you, you can try set it to auto.
encoding (str) – encoding used to read the CSV file. By default, utf-8 is used.

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_huggingface_datasets(dataset_path, field_resolver=None, size=None, sampling_rate=None, filter_fields=False, **datasets_kwargs)[source]#

Generator function for Hugging Face Datasets. Yields documents.

This function helps to load datasets from Hugging Face Datasets Hub (https://huggingface.co/datasets) in Jina. Additional parameters can be passed to the datasets library using keyword arguments. The load_dataset method from datasets library is used to load the datasets.

Parameters:

dataset_path (str) – a valid dataset path for Hugging Face Datasets library.
field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.
size (Optional[int]) – the maximum number of the documents
sampling_rate (Optional[float]) – the sampling rate between [0, 1]
filter_fields (bool) – specifies whether to filter the dataset with the fields given in `field_resolver argument.
**datasets_kwargs –
additional arguments for load_dataset method from Datasets library. More details at https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_ndjson(fp, field_resolver=None, size=None, sampling_rate=None, *args, **kwargs)[source]#

Generator function for line separated JSON. Yields documents.

Parameters:

fp (Iterable[str]) – file paths
field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.
size (Optional[int]) – the maximum number of the documents
sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_lines(lines=None, filepath=None, read_mode='r', line_format='json', field_resolver=None, size=None, sampling_rate=None)[source]#

Generator function for lines, json and csv. Yields documents or strings.

Parameters:

lines (Optional[Iterable[str]]) – a list of strings, each is considered as a document
filepath (Optional[str]) – a text file that each line contains a document
read_mode (str) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary
line_format (str) – the format of each line json or csv
field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.
size (Optional[int]) – the maximum number of the documents
sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield:

documents

Return type:

Generator[Document, None, None]