docarray.document.generators module#
- docarray.document.generators.from_ndarray(array, axis=0, size=None, shuffle=False, *args, **kwargs)[source]#
Create a generator for a given dimension of a numpy array.
- Parameters:
array (
ndarray
) – the numpy ndarray data sourceaxis (
int
) – iterate over that axissize (
Optional
[int
]) – the maximum number of the sub arraysshuffle (
bool
) – shuffle the numpy data source beforehand
- Yield:
documents
- Return type:
Generator
[Document
,None
,None
]
- docarray.document.generators.from_files(patterns, recursive=True, size=None, sampling_rate=None, read_mode=None, to_dataturi=False, exclude_regex=None, *args, **kwargs)[source]#
Creates an iterator over a list of file path or the content of the files.
- Parameters:
patterns (
Union
[str
,List
[str
]]) – The pattern may contain simple shell-style wildcards, e.g. ‘*.py’, ‘[*.zip, *.gz]’recursive (
bool
) – If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectoriessize (
Optional
[int
]) – the maximum number of the filessampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]read_mode (
Optional
[str
]) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary mode. If read_mode is None, will iterate over filenames.to_dataturi (
bool
) – if set, then the Document.uri will be filled with DataURI instead of the plan URIexclude_regex (
Optional
[str
]) – if set, then filenames that match to this pattern are not included.
- Yield:
file paths or binary content
Note
This function should not be directly used, use
Flow.index_files()
,Flow.search_files()
instead- Return type:
Generator
[Document
,None
,None
]
- docarray.document.generators.from_csv(file, field_resolver=None, size=None, sampling_rate=None, dialect='excel', encoding='utf-8', *args, **kwargs)[source]#
Generator function for CSV. Yields documents.
- Parameters:
file (
Union
[str
,TextIO
]) – file paths or file handlerfield_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined in JSON, dict to the field names defined in Document.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]dialect (
Union
[str
,Dialect
]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be acsv.Dialect
class that groups specific formatting parameters together. If you don’t know the dialect and the default one does not work for you, you can try set it toauto
.encoding (
str
) – encoding used to read the CSV file. By default,utf-8
is used.
- Yield:
documents
- Return type:
Generator
[Document
,None
,None
]
- docarray.document.generators.from_huggingface_datasets(dataset_path, field_resolver=None, size=None, sampling_rate=None, filter_fields=False, **datasets_kwargs)[source]#
Generator function for Hugging Face Datasets. Yields documents.
This function helps to load datasets from Hugging Face Datasets Hub (https://huggingface.co/datasets) in Jina. Additional parameters can be passed to the
datasets
library using keyword arguments. Theload_dataset
method fromdatasets
library is used to load the datasets.- Parameters:
dataset_path (
str
) – a valid dataset path for Hugging Face Datasets library.field_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined indocument
(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocument
is a JSON string or a Python dict.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]filter_fields (
bool
) – specifies whether to filter the dataset with the fields given in`field_resolver
argument.**datasets_kwargs –
additional arguments for
load_dataset
method from Datasets library. More details at https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset
- Yield:
documents
- Return type:
Generator
[Document
,None
,None
]
- docarray.document.generators.from_ndjson(fp, field_resolver=None, size=None, sampling_rate=None, *args, **kwargs)[source]#
Generator function for line separated JSON. Yields documents.
- Parameters:
fp (
Iterable
[str
]) – file pathsfield_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined indocument
(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocument
is a JSON string or a Python dict.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]
- Yield:
documents
- Return type:
Generator
[Document
,None
,None
]
- docarray.document.generators.from_lines(lines=None, filepath=None, read_mode='r', line_format='json', field_resolver=None, size=None, sampling_rate=None)[source]#
Generator function for lines, json and csv. Yields documents or strings.
- Parameters:
lines (
Optional
[Iterable
[str
]]) – a list of strings, each is considered as a documentfilepath (
Optional
[str
]) – a text file that each line contains a documentread_mode (
str
) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binaryline_format (
str
) – the format of each linejson
orcsv
field_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined indocument
(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocument
is a JSON string or a Python dict.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]
- Yield:
documents
- Return type:
Generator
[Document
,None
,None
]