docarray.document.generators module#
- docarray.document.generators.from_ndarray(array, axis=0, size=None, shuffle=False, *args, **kwargs)[source]#
Create a generator for a given dimension of a numpy array.
- Parameters:
array (
ndarray) – the numpy ndarray data sourceaxis (
int) – iterate over that axissize (
Optional[int]) – the maximum number of the sub arraysshuffle (
bool) – shuffle the numpy data source beforehand
- Yield:
documents
- Return type:
Generator[Document,None,None]
- docarray.document.generators.from_files(patterns, recursive=True, size=None, sampling_rate=None, read_mode=None, to_dataturi=False, exclude_regex=None, *args, **kwargs)[source]#
Creates an iterator over a list of file path or the content of the files.
- Parameters:
patterns (
Union[str,List[str]]) – The pattern may contain simple shell-style wildcards, e.g. ‘*.py’, ‘[*.zip, *.gz]’recursive (
bool) – If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectoriessize (
Optional[int]) – the maximum number of the filessampling_rate (
Optional[float]) – the sampling rate between [0, 1]read_mode (
Optional[str]) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary mode. If read_mode is None, will iterate over filenames.to_dataturi (
bool) – if set, then the Document.uri will be filled with DataURI instead of the plan URIexclude_regex (
Optional[str]) – if set, then filenames that match to this pattern are not included.
- Yield:
file paths or binary content
Note
This function should not be directly used, use
Flow.index_files(),Flow.search_files()instead- Return type:
Generator[Document,None,None]
- docarray.document.generators.from_csv(file, field_resolver=None, size=None, sampling_rate=None, dialect='excel', encoding='utf-8', *args, **kwargs)[source]#
Generator function for CSV. Yields documents.
- Parameters:
file (
Union[str,TextIO]) – file paths or file handlerfield_resolver (
Optional[Dict[str,str]]) – a map from field names defined in JSON, dict to the field names defined in Document.size (
Optional[int]) – the maximum number of the documentssampling_rate (
Optional[float]) – the sampling rate between [0, 1]dialect (
Union[str,Dialect]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be acsv.Dialectclass that groups specific formatting parameters together. If you don’t know the dialect and the default one does not work for you, you can try set it toauto.encoding (
str) – encoding used to read the CSV file. By default,utf-8is used.
- Yield:
documents
- Return type:
Generator[Document,None,None]
- docarray.document.generators.from_huggingface_datasets(dataset_path, field_resolver=None, size=None, sampling_rate=None, filter_fields=False, **datasets_kwargs)[source]#
Generator function for Hugging Face Datasets. Yields documents.
This function helps to load datasets from Hugging Face Datasets Hub (https://huggingface.co/datasets) in Jina. Additional parameters can be passed to the
datasetslibrary using keyword arguments. Theload_datasetmethod fromdatasetslibrary is used to load the datasets.- Parameters:
dataset_path (
str) – a valid dataset path for Hugging Face Datasets library.field_resolver (
Optional[Dict[str,str]]) – a map from field names defined indocument(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocumentis a JSON string or a Python dict.size (
Optional[int]) – the maximum number of the documentssampling_rate (
Optional[float]) – the sampling rate between [0, 1]filter_fields (
bool) – specifies whether to filter the dataset with the fields given in`field_resolverargument.**datasets_kwargs –
additional arguments for
load_datasetmethod from Datasets library. More details at https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset
- Yield:
documents
- Return type:
Generator[Document,None,None]
- docarray.document.generators.from_ndjson(fp, field_resolver=None, size=None, sampling_rate=None, *args, **kwargs)[source]#
Generator function for line separated JSON. Yields documents.
- Parameters:
fp (
Iterable[str]) – file pathsfield_resolver (
Optional[Dict[str,str]]) – a map from field names defined indocument(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocumentis a JSON string or a Python dict.size (
Optional[int]) – the maximum number of the documentssampling_rate (
Optional[float]) – the sampling rate between [0, 1]
- Yield:
documents
- Return type:
Generator[Document,None,None]
- docarray.document.generators.from_lines(lines=None, filepath=None, read_mode='r', line_format='json', field_resolver=None, size=None, sampling_rate=None)[source]#
Generator function for lines, json and csv. Yields documents or strings.
- Parameters:
lines (
Optional[Iterable[str]]) – a list of strings, each is considered as a documentfilepath (
Optional[str]) – a text file that each line contains a documentread_mode (
str) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binaryline_format (
str) – the format of each linejsonorcsvfield_resolver (
Optional[Dict[str,str]]) – a map from field names defined indocument(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocumentis a JSON string or a Python dict.size (
Optional[int]) – the maximum number of the documentssampling_rate (
Optional[float]) – the sampling rate between [0, 1]
- Yield:
documents
- Return type:
Generator[Document,None,None]