The method was applied to the whole set of
mammograms
contained in the MIAS database [169]. Three expert
mammographic readers classified all the images in the MIAS
database according to the BIRADS categories (the correlation
between the original triple MIAS and BIRADS classification is
discussed in [123]).
In screening mammography, it is common to obtain expert agreement;
here a similar approach is used and consensus between the
individual expert classification is used.
Table shows the confusion
matrix for the classification of the three radiologists and the
consensus opinion. This consensus is determined by selecting as
the final class, the class where two or three radiologists agreed
(majority vote). If the three experts classified the mammogram in
different classes, the median value is selected as the consensus
opinion. The results in Table
show divergence in the opinion of the radiologists, directly
illustrating the difficulty of the problem we are dealing with.
This disagreement indicates the need to remove inter-observer
(inter-operator) variability by the development of automatic
methods.
Using the
values the agreement of Expert A and C with the
consensus opinion fall in the Substantial category, whilst
the agreement of Expert B and the consensus opinion belongs to the
Almost Perfect category (i.e. the classification by Expert
B is almost equal to the consensus). Compared to the consensus,
Expert C shows a slight bias towards the higher BIRADS classes
than the other two experts, while Expert A shows a slight bias
towards the lower BIRADS classes.
Instead of using the majority vote to provide the consensus
classification, it is possible to use an expectation maximization
approach like STAPLE [187]. In this case, STAPLE
produced a consensus that was very close to the majority vote
results, with only two mammograms being classed differently. This
has minimal effects on the results: the maximum difference on the
overall classification results being
, while for the
individual BIRADS classes this increases to
(and here
positive changes for one BIRADS class are matched by negative
changes for one of the other BIRADS classes). For the remainder of
the chapter we have used the majority vote results as the
consensus classification results.
In order to test the proposed method, we performed two experiments related to the experts classification. Firstly, training the classifiers based on the ground truth as provided by the individual experts, we can evaluate the correlation between the methods and each radiologist. The second experiment was performed training the classifier using as ground truth the consensus between all three experts. In this case, we would expect an improved agreement as the inter-observer variability is minimized.