|
A Performance Evaluation Protocol for Content-Based Image Retrieval Algorithms/Systems |
||||
| Ground
Truth Keyword Annotation Direct Labeling Performance
Metrics Download |
We address an important yet controversial issue in
content-based image retrieval (CBIR): performance evaluation (PE). The
objective
is
to construct an objective and comprehensive protocol for
quantitative
evaluation
and comparison
of
CBIR performances among
different approaches. The protocol consists of two
elements:
acquisition
of appropriate
ground truth data,
and definition of quantitative
performance
metrics.
A
large image data set with human-label ground truths has been built as part
of the protocol and made
available for public use.
1
Ground Truth Database Acquisition
The first step to build such ground truth database is to select the
image collection. In our PE protocol for CBIR, we have selected 10,009 images in 79 semantic classes from the Corel Image Gallery [1]
due to their publicity and availability to a large audience. As we
discussed earlier, we use two ways to obtain ground truth: keyword
annotation and direct ground truth labeling. In order to increase ground truth acquisition
efficiency in both processes, we build a computer-aided tool based on the
semi-automatic annotation strategy implemented in our successful CBIR
system ‘MiAlbum’. In order to avoid subjectivity of single
annotators, we ask 2 annotators for keyword annotation and 7 annotators
for direct ground truth labeling. All these annotators are college
students with average knowledge of using computers. The final ground
truths are based on the statistical results of their work over the
selected image database. 1.1
Using Keyword Annotation
The Corel Gallery has more than one million
image collections. However, most of them are quite different no matter in
size or quality. Moreover, most of images have been assigned to very
abstracted semantic categories, such as ‘power’, ‘Asia’, etc. The image collections should be filtered for PE on CBIR. Hence,
we selected 79 classes with
concise and precise semantic meanings, which are
therefore suitable to constitute the ground truth database of PE on CBIR. By
default, we automatically annotate each image with its category name. In
addition to the 79 categories, we manually add 24 new keywords as we see
they are also relevant to certain images in the image collection we
selected. As a result, we have 103 keywords in total for the
annotators to annotate each of the images. The
reason why we predefine the keyword list is that we could reduce the
diversity of the annotation results from different people. One thing we want to emphasize is that all
keywords in the list are at the most specific level of semantics, e.g.,
“girl” (instead of “people”, or even higher, “animal”), tiger
(instead of “animal”), etc. This makes things simpler and avoids the
issue of different annotations with same meaning. The detail process is as follows. 1.
We ask the annotator to select a keyword from the keyword list and input
it into the keyword query textbox of MiAlbum [14]
and Hit the button “Search”. 2.
In the first batch of retrieval result, those images containing the
keyword query (as their category name) will be ranked high and displayed
on top of page in the image pane of MiAlbum. Others without such
annotation will be displayed randomly after those annotated images. The
annotator can click several images that he/she thinks are relevant to the
keyword query as positive examples and hit “Refine” button. After
several iterations of such feedback, most of the promising images that are
relevant to the keywords will be displayed on top ranks among the total
10,009 images. Hence, the annotator only need to check the front part of
the total list carefully and the rest less carefully to increase the
annotation efficiency. 3. All positive examples are annotated with the keyword query. Actually,
this is the idea of semi-automatic annotation [14]
. 4.
Repeating the above process for each keyword in the keyword list, we can
obtain all ground truths based on annotation. 1.2
Direct Ground Truth Labeling
Another way to obtain ground truth
is
based on manual specification, that
is, for each query image, manually
collect or specify the list of images
that should be retrieved. Similarly to keyword annotation in the previous sub-section, we can use
an image as the query and use the same semi-automatic
annotation method to collect those direct ground truth, which is judged by
the annotator as visually similar to the query image. However, since there are 10,009 images, the efforts to obtain
such direct ground truths for all images are too huge for us to afford, we
only select 200 images as the queries. 2 Definitions of Performance Metrics2.1
Performance Metrics for Individual Query
We firstly define several terms related to
our metrics definition. The
image set is denoted as C={c1,…,cTotal_Count}.
After keyword annotation, for
each image ci?C
there is a corresponding list of keywords
and
associated weights Ki={(ki1, wi1)…,(ki
m(I), wim(i))}, where m(i) refers to the number of
keyword associated with image ci.
For direct ground truth
labeling, we have a standard query example set Q={q1,…,qTotal_Query_Count}.
For each query example qi, there is a corresponding list
of images and associated weights Ri={(ri1, wi1),…,(rin(i),
wi n(i))}, where n(i) denotes the number of
the images should be retrieved.
Notice that we present all the result list by the descensive order,
which means rij is more similar than ri,j+1
to the query qi. The weight to each result or keyword
could be calculated based on the statistics of the annotation results.
Here we use a very simple strategy. Suppose there are di
people using keyword kis to annotate image ci , so
For direct ground truth labeling, it is quite easy to use the widely used PE metrics from TREC. For qi?Q, the retrieval result ranking list L(qi )= {r’i1,…,r’iTotal_Count}, where r’i1 is the most similar image in C with the qi, r’i2 is the second most similar image to qi and so on. We also define a function
The
following metrics are presented in our proposal.
PN
– the un-weighted precision after the first N documents/images
are retrieved
PwN – the weighted precision
after the first N documents are retrieved
Average Precision – the average precision
of un-weighted precision after the first 10 to 100 documents are retrieved
Weighted Average Precision – the average
precision of weighted precision after the first 10 to 100 documents are
retrieved
R 100 – un-weighted recall after
100 documents are retrieved
ANMRR
, – Average Normalized Modified Retrieval Rank
After keyword annotation, we have got keyword
annotation information of each image from C. So we will use every
image in C as the query to do retrieval and get the result. We will
evaluate the algorithm by averaging all the results we got.
For ci?C, consider ci as the query example, suppose the
retrieval result ranking list L(ci )= {c’i1,…,c’iTotal_Query_Count},
where c1 is the most similar image in C with
the ci, c2 is the second most similar image
to ci and
so on. We can also get the methods similar with previous metrics.
However, we have no precise result list to the specific query, we have to
define the similarity measure between any two images by the annotation
information first.
Based on the similarity measure definition,
we can define all of the previous defined metrics. For example, the
following metrics of PN, PwN and PrN can
be defined as the following.
PN, – the un-weighted precision after the first N documents are retrieved
PwN, – the weighted precision
after the first N documents are retrieved
PrN, – the weighted precision
after the first N documents are retrieved with rank consideration.
2.2
Performance Metrics on Multi Queries (Relevance Feedback)
Relevance
feedback is a supervised learning and active technique used to improve the
effectiveness of information retrieval systems. The process of relevance
feedback in CBIR is as follows. For
a given query, the system first retrieves a list of ranked images
according to a predefined similarity metric, which is often defined by the
distance between the query vector and the feature vectors of images in a
database. Then, the user is
asked to select/provide a set of positive and/or negative examples from
the retrieved images, and the system then refines the retrieval result To measure the
effectiveness of multi queries/relevance feedback, we also define a set of
performance metrics to evaluate relevance feedback algorithms for CBIR. Positive Feedback For qi?Q, suppose the retrieval result ranking list L(qi )= {r’i1,…,r’iTotal_Count},
The positive feedback process will go on in this way: put ri,1,…,
ri,m as the positive examples into the CBIR system, where m
is the number of feedback images, then
re-calculate the performance of the algorithm’s output. We then put ri,m+1,…,
ri,2m as the second positive example set, and so on. We
will use the improvement of the Average Precision, Weighted Average
Precision, and R 100 to evaluate the performance of the
positive feedback. Negative Feedback |
|