A Performance Evaluation Protocol for Content-Based 

Image Retrieval Algorithms/Systems

Ground Truth
Keyword Annotation

Direct Labeling

Performance Metrics
Individual Query
Relevance Feedback

Download
Ground Truth DB
Readme File
MiAlbum Beta1

Other info
TREC
Corel

We address an important yet controversial issue in content-based image retrieval (CBIR): performance evaluation (PE). The objective is to construct an objective and comprehensive protocol for quantitative evaluation and comparison of CBIR performances among different approaches. The protocol consists of two elements: acquisition of appropriate ground truth data, and definition of quantitative performance metrics.  A large image data set with human-label ground truths has been built as part of the protocol and made available for public use. 

1     Ground Truth Database Acquisition

The first step to build such ground truth database is to select the image collection. In our PE protocol for CBIR, we have selected 10,009 images in 79 semantic classes from the Corel Image Gallery [1] due to their publicity and availability to a large audience. As we discussed earlier, we use two ways to obtain ground truth: keyword annotation and direct ground truth labeling.

In order to increase ground truth acquisition efficiency in both processes, we build a computer-aided tool based on the semi-automatic annotation strategy implemented in our successful CBIR system ‘MiAlbum’. In order to avoid subjectivity of single annotators, we ask 2 annotators for keyword annotation and 7 annotators for direct ground truth labeling. All these annotators are college students with average knowledge of using computers. The final ground truths are based on the statistical results of their work over the selected image database. 

1.1     Using Keyword Annotation

The Corel Gallery has more than one million image collections. However, most of them are quite different no matter in size or quality. Moreover, most of images have been assigned to very abstracted semantic categories, such as ‘power’, ‘Asia’, etc.  The image collections should be filtered for PE on CBIR. Hence, we selected 79 classes with concise and precise semantic meanings, which are therefore suitable to constitute the ground truth database of PE on CBIR.  By default, we automatically annotate each image with its category name. In addition to the 79 categories, we manually add 24 new keywords as we see they are also relevant to certain images in the image collection we selected. As a result, we have 103 keywords in total for the annotators to annotate each of the images. The reason why we predefine the keyword list is that we could reduce the diversity of the annotation results from different people.  One thing we want to emphasize is that all keywords in the list are at the most specific level of semantics, e.g., “girl” (instead of “people”, or even higher, “animal”), tiger (instead of “animal”), etc. This makes things simpler and avoids the issue of different annotations with same meaning. The detail process is as follows.

1. We ask the annotator to select a keyword from the keyword list and input it into the keyword query textbox of MiAlbum [14] and Hit the button “Search”.

2. In the first batch of retrieval result, those images containing the keyword query (as their category name) will be ranked high and displayed on top of page in the image pane of MiAlbum. Others without such annotation will be displayed randomly after those annotated images. The annotator can click several images that he/she thinks are relevant to the keyword query as positive examples and hit “Refine” button. After several iterations of such feedback, most of the promising images that are relevant to the keywords will be displayed on top ranks among the total 10,009 images. Hence, the annotator only need to check the front part of the total list carefully and the rest less carefully to increase the annotation efficiency.

3. All positive examples are annotated with the keyword query. Actually, this is the idea of semi-automatic annotation [14] .

4. Repeating the above process for each keyword in the keyword list, we can obtain all ground truths based on annotation.

1.2     Direct Ground Truth Labeling

Another way to obtain ground truth is based on manual specification, that is, for each query image, manually collect or specify the list of images that should be retrieved.

Similarly to keyword annotation in the previous sub-section, we can use an image as the query and use the same semi-automatic annotation method to collect those direct ground truth, which is judged by the annotator as visually similar to the query image.  However, since there are 10,009 images, the efforts to obtain such direct ground truths for all images are too huge for us to afford, we only select 200 images as the queries.

Part of the ground truth database of our PE protocol for CBIR is available at our website (http://www.research.microsoft.com/ users/wyliu/PE_CBIR/). It consists of an index file for filenames of 10,009 Corel Images, an index file of keyword annotation for these images and a file containing 160 image queries and their corresponding ground truth. Research groups can get the ground truth database and use it for PE of their own CBIR algorithms/systems by themselves. We keep the other 40 queries and their ground truths secret such that we can provide third-party evaluation service to other research groups.

2     Definitions of Performance Metrics

2.1     Performance Metrics for Individual Query

We firstly define several terms related to our metrics definition.  The image set is denoted as C={c1,…,cTotal_Count}. After keyword annotation, for each image ci?C there is a corresponding list of keywords and associated weights Ki={(ki1, wi1)…,(ki m(I), wim(i))}, where m(i) refers to the number of keyword associated with image ci.  For direct ground truth labeling, we have a standard query example set Q={q1,…,qTotal_Query_Count}. For each query example qi, there is a corresponding list of images and associated weights Ri={(ri1, wi1),…,(rin(i), wi n(i))}, where n(i) denotes the number of the images should be retrieved.  Notice that we present all the result list by the descensive order, which means rij is more similar than ri,j+1 to the query qi. The weight to each result or keyword could be calculated based on the statistics of the annotation results. Here we use a very simple strategy. Suppose there are di people using keyword kis to annotate image ci , so

         (1)

For direct ground truth labeling, it is quite easy to use the widely used PE metrics from TREC.  For  qi?Q, the retrieval result ranking list L(qi )= {r’i1,…,r’iTotal_Count}, where r’i1 is the most similar image in C with the qi, r’i2 is the second most similar image to qi  and so on.  We also define a function

.

 The following metrics are presented in our proposal.

               PN – the un-weighted precision after the first N documents/images are retrieved

                (2)   where  

              PwN – the weighted precision after the first N documents are retrieved

              (3)     where .

               Average Precision – the average precision of un-weighted precision after the first 10 to 100 documents are retrieved

                           (4)

               Weighted Average Precision – the average precision of weighted precision after the first 10 to 100 documents are retrieved

                     (5)

                R 100 – un-weighted recall after 100 documents are retrieved

    (6) where .    

          ANMRR , – Average Normalized Modified Retrieval Rank

  (7)    where , , and .

After keyword annotation, we have got keyword annotation information of each image from C. So we will use every image in C as the query to do retrieval and get the result. We will evaluate the algorithm by averaging all the results we got.  For ci?C, consider ci as the query example, suppose the retrieval result ranking list L(ci )= {c’i1,…,c’iTotal_Query_Count}, where c1 is the most similar image in C with the ci, c2 is the second most similar image to ci  and so on. We can also get the methods similar with previous metrics. However, we have no precise result list to the specific query, we have to define the similarity measure between any two images by the annotation information first.

                     (8)

         (9)

Based on the similarity measure definition, we can define all of the previous defined metrics. For example, the following metrics of PN, PwN and PrN can be defined as the following.

              PN, – the un-weighted precision after the first N documents are retrieved

                            (10)  where

     PwN, – the weighted precision after the first N documents are retrieved

                          (11) where

      PrN, – the weighted precision after the first N documents are retrieved with rank consideration.

                     (12)   where

2.2     Performance Metrics on Multi Queries (Relevance Feedback)

Relevance feedback is a supervised learning and active technique used to improve the effectiveness of information retrieval systems. The process of relevance feedback in CBIR is as follows.  For a given query, the system first retrieves a list of ranked images according to a predefined similarity metric, which is often defined by the distance between the query vector and the feature vectors of images in a database.  Then, the user is asked to select/provide a set of positive and/or negative examples from the retrieved images, and the system then refines the retrieval result

To measure the effectiveness of multi queries/relevance feedback, we also define a set of performance metrics to evaluate relevance feedback algorithms for CBIR.

 Positive Feedback

For  qi?Q, suppose the retrieval result ranking list L(qi )= {r’i1,…,r’iTotal_Count}, The positive feedback process will go on in this way: put ri,1,…, ri,m as the positive examples into the CBIR system, where m is the number of feedback images,  then re-calculate the performance of the algorithm’s output. We then put ri,m+1,…, ri,2m as the second positive example set, and so on. We will use the improvement of the Average Precision, Weighted Average Precision, and R 100 to evaluate the performance of the positive feedback.

 Negative Feedback

For negative feedback, suppose the retrieval result ranking list to qi is L(qi )= {r’i1,…,r’iTotal_Count},  where r’i1 is the most similar image in C with qi, r’i2 is the second and so on. We choose the image r’ik as the negative example and provide it to the system, where 0<o<k, r’io ? Ri, and r’ik  Ri. Then we re-calculate the Average Precision, Weighted Average Precision and R100 and to observe its improvements. Similarly, after the first round of negative feedback, we can choose the first two negative examples from the result list to the system to get the new results.  Then choose 3,4 and so on to get more results.