Evaluate Matches#

After executing match(), your DocumentArray receives a .matches attribute. You can evaluate these matches against the ground truth via evaluate(). The ground truth describes which matches are relevant and non-relevant, and can be provided in two formats: as a ground truth array or as labels.

To demonstrate this, let’s create a DocumentArray with random embeddings and match it to itself:

import numpy as np
from docarray import DocumentArray

da_original = DocumentArray.empty(10)
da_original.embeddings = np.random.random([10, 3])
da_original.match(da_original, exclude_self=True)

da_original.summary()
                     Documents Summary                      
                                                            
  Length                    10                              
  Homogenous Documents      True                            
  Has nested Documents in   ('matches',)                    
  Common Attributes         ('id', 'embedding', 'matches')  
                                                            
                        Attributes Summary                        
                                                                  
  Attribute   Data type         #Unique values   Has empty value  
 ──────────────────────────────────────────────────────────────── 
  embedding   ('ndarray',)      10               False            
  id          ('str',)          10               False            
  matches     ('MatchArray',)   10               False                

Now da.matches contains the nearest neighbors. To make this more interesting, let’s mix in ten “noise Documents” in every d.matches:

da_prediction = DocumentArray(da_original, copy=True)

for d in da_prediction:
    d.matches.extend(DocumentArray.empty(10))
    d.matches = d.matches.shuffle()

da_prediction['@m'].summary()
                              Documents Summary                               
                                                                              
  Length                          190                                         
  Homogenous Documents            False                                       
  100 Documents have attributes   ('id', 'adjacency')                         
  90 Documents have attributes    ('id', 'adjacency', 'embedding', 'scores')  
                                                                              
                            Attributes Summary                            
                                                                          
  Attribute   Data type                 #Unique values   Has empty value  
 ──────────────────────────────────────────────────────────────────────── 
  adjacency   ('int',)                  1                False            
  embedding   ('ndarray', 'NoneType')   190              True             
  id          ('str',)                  110              False            
  scores      ('defaultdict',)          190              False            

Evaluation against a ground truth array#

To evaluate the matches against a ground-truth array, you pass a DocumentArray (like da_groundtruth) to the evaluate() method:

da_prediction.evaluate(ground_truth=da_groundtruth, metrics=['...'], **kwargs)

Thereby, da_groundtruth should contain the same Documents as in da_prediction. Each matches attribute contains exactly those Documents which are relevant to the respective root document.

You define the metrics you want to use for your evaluation (e.g. precision_at_k) with the metrics parameter.

Let’s evaluate the da_prediction DocumentArray (with the noisy matches) against da_original:

da_prediction.evaluate(ground_truth=da_original, metrics=['precision_at_k'], k=10)
{'precision_at_k': 0.45}

This returns the average value for the precision_at_k metric, calculated over all Documents of da_prediction. To see the individual evaluation values, check the evaluations attribute:

for d in da_prediction:
    print(d.evaluations['precision_at_k'].value)
0.5
0.5
0.5
0.6
0.3
0.4
0.5
0.4
0.5
0.3

Document identifier#

Note that evaluating a DocumentArray against a ground truth DocumentArray only works if both have the same length and nested structure. It makes no sense to evaluate with a completely different DocumentArray.

While evaluating, Document pairs are recognized as correct if they share the same identifier. By default, this is just id. You can customize this by specifying hash_fn.

Let’s see an example by creating two DocumentArrays. Each DocumentArray has matches that are identical to each other, but differ from the matches of the other DocumentArray:

from docarray import DocumentArray, Document

p_da = DocumentArray.empty(3)

for d in p_da:
    d.matches.append(Document(text='my predict'))

g_da = DocumentArray.empty(3)
for d in g_da:
    d.matches.append(Document(text='my ground truth'))

Now when you evaluate, you’ll receive an error:

p_da.evaluate('average_precision', ground_truth=g_da)
ValueError: Document <Document ('id', 'matches') at 42dc84b26fab11ecbc181e008a366d49> from the left-hand side and <Document ('id', 'matches') at 42dc98086fab11ecbc181e008a366d49> from the right-hand are not hashed to the same value. This means your left and right DocumentArray may not be aligned; or it means your `hash_fn` is badly designed.

This says that based on .id (the default identifier), the two DocumentArrays are so different that they can’t be evaluated. It is a valid point because our two DocumentArrays have completely random .ids.

If we override the hash function, the evaluation can proceed:

p_da.evaluate('average_precision', ground_truth=g_da, hash_fn=lambda d: d.text[:2])
{'average_precision': 1.0}

This is correct, as we define evaluation as checking if the first two characters in .text (in this case, my) are the same.

Evaluation via labels#

Alternatively, you can evaluate your Documents by adding labels. A match is considered relevant to its root Document if it has the same label:

import numpy as np
from docarray import Document, DocumentArray

example_da = DocumentArray([Document(tags={'label': (i % 2)}) for i in range(10)])
example_da.embeddings = np.random.random([10, 3])

example_da.match(example_da)

example_da.evaluate(metrics=['precision_at_k'])
{'precision_at_k': 0.5}

Also here, results are stored in the .evaluations attribute of each Document.

Metric functions#

DocArray provides common metrics used in the information retrieval community to evaluate nearest-neighbor matches. Some of those metrics accept additional arguments as kwargs which you can add to the call of the evaluate() method:

Metric

Accept kwargs

r_precision()

None

average_precision()

None

reciprocal_rank()

None

precision_at_k()

k

hit_at_k()

k

recall_at_k()

max_rel, k

f1_score_at_k()

max_rel, k

dcg_at_k()

method, k

ndcg_at_k()

method, k

Danger

These metric scores might change if you set the limit argument of the match method differently.

Note: Not all of these metrics can be applied to a top-K result, i.e., ndcg_at_k and r_precision are calculated correctly only if the limit is set equal to or higher than the number of Documents in the DocumentArray provided to the match method.

You can evaluate multiple metric functions at once:

da_prediction.evaluate(
    ground_truth=da_original, metrics=['precision_at_k', 'reciprocal_rank'], k=10
)
{'precision_at_k': 0.45, 'reciprocal_rank': 0.8166666666666667}

In this case, the keyword argument k is passed to all metric functions, even though it fulfills no specific function for calculating the reciprocal rank.

The max_rel parameter#

Some metric functions shown in the table above require a max_rel parameter. This parameter should be set to the number of relevant Documents in the Document collection. Without the knowledge of this number, metrics like recall_at_k and f1_score_at_k cannot be calculated.

In the evaluate function, you can provide a keyword argument max_rel, which is then used for all queries. In the example below, we can use the datasets da_prediction and da_original from the beginning, where each query has nine relevant Documents. Therefore, we set max_rel=9.

da_prediction.evaluate(ground_truth=da_original, metrics=['recall_at_k'], max_rel=9)
{'recall_at_k': 1.0}

Since all relevant Documents are in the matches, the recall is one. However, this only makes sense if the number of relevant Documents is equal for each query. If you provide a ground_truth parameter to the evaluate function, max_rel is set to the number of matches of the query Document.

da_prediction.evaluate(ground_truth=da_original, metrics=['recall_at_k'])
{'recall_at_k': 1.0}

For labeled datasets, this is not possible. Here, you can set the num_relevant_documents_per_label parameter of evaluate. It accepts a dictionary that contains the number of relevant Documents for each label. In this way, the function can set max_rel to the correct value for each query Document.

example_da.evaluate(
    metrics=['recall_at_k'], num_relevant_documents_per_label={0: 5, 1: 5}
)
{'recall_at_k': 1.0}

Custom metrics#

If pre-defined metrics don’t fit your use case, you can define a custom metric function, taking as input a list of binary relevance judgements of a query (1 and 0 values). The evaluate function already calculates this binary list from the matches attribute so that each number represents the relevancy of a match.

Let’s write a custom metric function, which counts the number of relevant Documents per query:

def count_relevant(binary_relevance):
    return sum(binary_relevance)


da_prediction.evaluate(ground_truth=da_original, metrics=[count_relevant])
{'count_relevant': 9.0}

As inspiration for writing your own metric function, see DocArray’s evaluation module, which contains the implementations of the custom metric functions.

Custom names#

By default, metrics are stored with the name of the metric function. Alternatively, you can customize those names with the metric_names argument of the evaluate method:

da_prediction.evaluate(
    ground_truth=da_original,
    metrics=[count_relevant, 'precision_at_k'],
    metric_names=['#Relevant', 'Precision@K'],
)
{'#Relevant': 9.0, 'Precision@K': 0.47368421052631576}

Embed, match and evaluate at once#

Instead of executing the methods embed(), match(), and evaluate() separately, you can execute them all at once with embed_and_evaluate().

To demonstrate this, let’s construct two labeled DocumentArrays example_queries and example_index. example_index should be matched with example_queries and then we want to evaluate the reciprocal rank based on the matches’ labels in example_queries.

import numpy as np
from docarray import Document, DocumentArray

example_queries = DocumentArray([Document(tags={'label': (i % 2)}) for i in range(10)])
example_index = DocumentArray([Document(tags={'label': (i % 2)}) for i in range(10)])


def embedding_function(da):
    da[:, 'embedding'] = np.random.random((len(da), 5))


result = example_queries.embed_and_evaluate(
    'reciprocal_rank', index_da=example_index, embed_funcs=embedding_function
)
print(result)
{'reciprocal_rank': 0.7583333333333333}

For metric functions which require a max_rel parameter, the embed_and_evaluate function (described later in this section) automatically constructs the dictionary for num_relevant_documents_per_label based on the index_data argument.

Batch-wise matching#

embed_and_evaluate is especially useful to evaluate queries on a Document collection (like example_index) which is too large to fit the embeddings of all Documents in main memory. In this case, the method matches queries to batches of the Document collection, then deletes embeddings after processing each batch.

By default, the batch size for the matching (match_batch_size) is set to 100_000. To reduce the memory footprint, you can set it to a lower value.

Sampling Queries#

To evaluate a large dataset, it might be useful to sample query Documents. Since the metric values returned by embed_and_evaluate are mean values, sampling shouldn’t significantly change the result if the sample is large enough. By default, sampling is applied for DocumentArrays with over 1,000 Documents. However, it’s only applied on the DocumentArray itself and not on the Document provided in index_data.

To change the number of samples, you can adjust the query_sample_size argument. In the following code block an evaluation is performed with 100 samples:

import numpy as np
from docarray import Document, DocumentArray


def emb_func(da):
    for d in da:
        np.random.seed(int(d.text))
        d.embedding = np.random.random(5)


da = DocumentArray(
    [Document(text=str(i), tags={'label': i % 10}) for i in range(1_000)]
)

da.embed_and_evaluate(
    metrics=['precision_at_k'], embed_funcs=emb_func, query_sample_size=100
)
{'precision_at_k': 0.13649999999999998}

Note that in this way, only Documents which are actually evaluated obtain an .evaluations attribute.

To test how close it is to the exact result, you can execute the function again with query_sample_size set to 1_000:

da.embed_and_evaluate(
    metrics=['precision_at_k'], embed_funcs=emb_func, query_sample_size=1_000
)
{'precision_at_k': 0.14245}