docarray.array.mixins.match module#
- class docarray.array.mixins.match.MatchMixin[source]#
Bases:
object
A mixin that provides match functionality to DocumentArrays
- match(darray, metric='cosine', limit=20, normalization=None, metric_name=None, batch_size=None, exclude_self=False, filter=None, only_id=False, use_scipy=False, device='cpu', num_worker=1, on=None, **kwargs)[source]#
Compute embedding based nearest neighbour in another for each Document in self, and store results in matches. For the purpose of evaluation, one can also directly use the
embed_and_evaluate()
function. .. note:'cosine', 'euclidean', 'sqeuclidean' are supported natively without extra dependency. You can use other distance metric provided by ``scipy``, such as `braycurtis`, `canberra`, `chebyshev`, `cityblock`, `correlation`, `cosine`, `dice`, `euclidean`, `hamming`, `jaccard`, `jensenshannon`, `kulsinski`, `mahalanobis`, `matching`, `minkowski`, `rogerstanimoto`, `russellrao`, `seuclidean`, `sokalmichener`, `sokalsneath`, `sqeuclidean`, `wminkowski`, `yule`. To use scipy metric, please set ``use_scipy=True``.
To make all matches values in [0, 1], use
dA.match(dB, normalization=(0, 1))
- To invert the distance as score and make all values in range [0, 1],
use
dA.match(dB, normalization=(1, 0))
. Note, hownormalization
differs from the previous.
If a custom metric distance is provided. Make sure that it returns scores as distances and not similarity, meaning the smaller the better.
- Parameters:
darray (DocumentArray) – the other DocumentArray to match against
metric (
Union
[str
,Callable
[[ArrayType, ArrayType],ndarray
]]) – the distance metriclimit (
Union
[int
,float
,None
]) – the maximum number of matches, when not given defaults to 20.normalization (
Optional
[Tuple
[float
,float
]]) – a tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].metric_name (
Optional
[str
]) – if provided, then match result will be marked with this string.batch_size (
Optional
[int
]) – if provided, thendarray
is loaded in batches, where each of them is at mostbatch_size
elements. When darray is big, this can significantly speedup the computation.exclude_self (
bool
) – if set, Documents indarray
with sameid
as the left-hand values will not be considered as matches.filter (
Optional
[Dict
]) – filter query used for pre-filteringonly_id (
bool
) – if set, then returning matches will only containid
use_scipy (
bool
) – if set, usescipy
as the computation backend. Note,scipy
does not support distance on sparse matrix.device (
str
) – the computational device for.match()
, can be either cpu or cuda.num_worker (
Optional
[int
]) –the number of parallel workers. If not given, then the number of CPUs in the system will be used.
Note
This argument is only effective when
batch_size
is set.on (
Optional
[str
]) – specifies a subindex to search on. If set, the returned DocumentArray will be retrieved from the given subindex.kwargs – other kwargs.
- Return type:
None