docarray.document.generators module#

docarray.document.generators.from_ndarray(array, axis=0, size=None, shuffle=False, *args, **kwargs)[source]#

Create a generator for a given dimension of a numpy array.

Parameters:
  • array (ndarray) – the numpy ndarray data source

  • axis (int) – iterate over that axis

  • size (Optional[int]) – the maximum number of the sub arrays

  • shuffle (bool) – shuffle the numpy data source beforehand

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_files(patterns, recursive=True, size=None, sampling_rate=None, read_mode=None, to_dataturi=False, exclude_regex=None, *args, **kwargs)[source]#

Creates an iterator over a list of file path or the content of the files.

Parameters:
  • patterns (Union[str, List[str]]) – The pattern may contain simple shell-style wildcards, e.g. ‘*.py’, ‘[*.zip, *.gz]’

  • recursive (bool) – If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectories

  • size (Optional[int]) – the maximum number of the files

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

  • read_mode (Optional[str]) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary mode. If read_mode is None, will iterate over filenames.

  • to_dataturi (bool) – if set, then the Document.uri will be filled with DataURI instead of the plan URI

  • exclude_regex (Optional[str]) – if set, then filenames that match to this pattern are not included.

Yield:

file paths or binary content

Note

This function should not be directly used, use Flow.index_files(), Flow.search_files() instead

Return type:

Generator[Document, None, None]

docarray.document.generators.from_csv(file, field_resolver=None, size=None, sampling_rate=None, dialect='excel', encoding='utf-8', *args, **kwargs)[source]#

Generator function for CSV. Yields documents.

Parameters:
  • file (Union[str, TextIO]) – file paths or file handler

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in JSON, dict to the field names defined in Document.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

  • dialect (Union[str, Dialect]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be a csv.Dialect class that groups specific formatting parameters together. If you don’t know the dialect and the default one does not work for you, you can try set it to auto.

  • encoding (str) – encoding used to read the CSV file. By default, utf-8 is used.

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_huggingface_datasets(dataset_path, field_resolver=None, size=None, sampling_rate=None, filter_fields=False, **datasets_kwargs)[source]#

Generator function for Hugging Face Datasets. Yields documents.

This function helps to load datasets from Hugging Face Datasets Hub (https://huggingface.co/datasets) in Jina. Additional parameters can be passed to the datasets library using keyword arguments. The load_dataset method from datasets library is used to load the datasets.

Parameters:
  • dataset_path (str) – a valid dataset path for Hugging Face Datasets library.

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

  • filter_fields (bool) – specifies whether to filter the dataset with the fields given in `field_resolver argument.

  • **datasets_kwargs

    additional arguments for load_dataset method from Datasets library. More details at https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_ndjson(fp, field_resolver=None, size=None, sampling_rate=None, *args, **kwargs)[source]#

Generator function for line separated JSON. Yields documents.

Parameters:
  • fp (Iterable[str]) – file paths

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield:

documents

Return type:

Generator[Document, None, None]

docarray.document.generators.from_lines(lines=None, filepath=None, read_mode='r', line_format='json', field_resolver=None, size=None, sampling_rate=None)[source]#

Generator function for lines, json and csv. Yields documents or strings.

Parameters:
  • lines (Optional[Iterable[str]]) – a list of strings, each is considered as a document

  • filepath (Optional[str]) – a text file that each line contains a document

  • read_mode (str) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary

  • line_format (str) – the format of each line json or csv

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield:

documents

Return type:

Generator[Document, None, None]