MIAS Database

The method was applied to the whole set of

mammograms contained in the MIAS database [169]. Three expert mammographic readers classified all the images in the MIAS database according to the BIRADS categories (the correlation between the original triple MIAS and BIRADS classification is discussed in [123]).

In screening mammography, it is common to obtain expert agreement; here a similar approach is used and consensus between the individual expert classification is used. Table

shows the confusion matrix for the classification of the three radiologists and the consensus opinion. This consensus is determined by selecting as the final class, the class where two or three radiologists agreed (majority vote). If the three experts classified the mammogram in different classes, the median value is selected as the consensus opinion. The results in Table

show divergence in the opinion of the radiologists, directly illustrating the difficulty of the problem we are dealing with. This disagreement indicates the need to remove inter-observer (inter-operator) variability by the development of automatic methods.

Using the $\kappa$ values the agreement of Expert A and C with the consensus opinion fall in the Substantial category, whilst the agreement of Expert B and the consensus opinion belongs to the Almost Perfect category (i.e. the classification by Expert B is almost equal to the consensus). Compared to the consensus, Expert C shows a slight bias towards the higher BIRADS classes than the other two experts, while Expert A shows a slight bias towards the lower BIRADS classes.

Instead of using the majority vote to provide the consensus classification, it is possible to use an expectation maximization approach like STAPLE [187]. In this case, STAPLE produced a consensus that was very close to the majority vote results, with only two mammograms being classed differently. This has minimal effects on the results: the maximum difference on the overall classification results being $\pm 0.3 \%$ , while for the individual BIRADS classes this increases to $\pm 1.1 \%$ (and here positive changes for one BIRADS class are matched by negative changes for one of the other BIRADS classes). For the remainder of the chapter we have used the majority vote results as the consensus classification results.

Table 3.3: Confusion matrices for three mammographic expert radiologists and their consensus opinion.

Expert A ( $78\%,\kappa = 0.70$ )

Expert B ( $89\%,\kappa = 0.85$ )

Expert C ( $72\%,\kappa = 0.61$ )

	B-I	B-II	B-III	B-IV
B-I			0	0
	B-II			0
	B-III
	B-IV	0	0	0

B-I	B-II	B-III	B-IV
		0	0
			0
0
0	0	0

B-I	B-II	B-III	B-IV
		0	0
0			0
0	0
0	0

In order to test the proposed method, we performed two experiments related to the experts classification. Firstly, training the classifiers based on the ground truth as provided by the individual experts, we can evaluate the correlation between the methods and each radiologist. The second experiment was performed training the classifier using as ground truth the consensus between all three experts. In this case, we would expect an improved agreement as the inter-observer variability is minimized.