Training and Testing using Different Databases

We analyze in this section the performance of the algorithm when using different databases for learning the size and shape of the masses and for detecting them in the images. For such task, we used the MIAS database to test the system and the DDSM [68] one for training. Due to the large mass variability of DDSM database we used six different sizes to train the system:

, and the number of masses in each interval was respectively,

, and

masses. Moreover, for the false positive reduction step,

normal RoIs for each mass RoI were included in each size-cluster.

**Figure 5.9:** Influence of the training database. MIAS is referred to training and testing the systems using the same database (the MIAS one), while DDSM is referred to training and testing using different databases (the DDSM for training and the MIAS for testing).
$\includegraphics[width=10.5 cm]{images/frocTraining.eps}$

Figure

shows the performance of the algorithm . The grey lines show the performance of the proposal without the false positive reduction step, while the black ones including it. The lines with squares are obtained when training and testing with the same database, while the lines with pentagrams when training and testing using different databases. We can see that the Bayesian pattern matching has more false positives per image when is trained with different database. This is basically due to the fact that we are now training with more sizes and, further, smaller patterns. Thus, there is a large set of small regions being normal tissue but detected as suspicious by the algorithm. However, the false positive reduction step allows to greatly reduce this number, although the performance is slightly worst compared to training and testing with the same database. For instance, at a sensitivity of

the number of false positives per image when training and testing using different databases was

without the false positive reduction algorithm and

when including it, while when training and testing using the same database the false positives were

and

, respectively.

Figure

shows the comparison of the algorithm trained with DDSM when testing the set of

mammograms from MIAS database and algorithms d1 and d2. Note that the performance of the proposal Eig is similar to algorithm d2 at sensitivities around

. In contrast, is clearly better at higher sensitivities and worst at intermediate sensitivities. Note that when including the false positive reduction step the performance is clearly better.

**Figure 5.10:** FROC analysis comparison using algorithms d1, d2, and the proposal without (*Eig*) and with (*Eig $\&$ FPRed)*) false positive reduction but using different databases for training and testing.
$\includegraphics[width=10.5 cm]{images/frocTrainingComparison.eps}$

On the other hand, using ROC analysis for the set of

mammograms containing masses, we found that mean

without false positive reduction was $86.6\pm 6.9$ , while including it was $86.2\pm 7.3$ . This results are slightly worse compared to the algorithm trained and tested using the same database ( $A_z = 89.3 \pm 5.9$ and $A_z = 89.1 \pm 4.1$ , respectively) and also compared to algorithm d2 ( $A_z = 88.1 \pm 8.4$ ). However, note that this algorithm is still trained and tested using the same database. On the other hand, both proposals outperforms algorithm d1 ( $A_z = 84.1 \pm 7.9$ ).

Table

shows the mean

values detailed per mass size when the training and testing was done using the same database or using different databases. Note that the main performance drop is for the smaller masses, where the mean

is reduced around

units. This is basically due to the number of false positives detected by the template matching algorithm at small template sizes. The false positive reduction step allows to decrease the number of false positives, although in the cases where this algorithm increases the number of false negatives (classifying a true mass as normal tissue) the mean

of the system is drastically reduced. For the other sizes, the performance is similar when the training and testing was done using the same database or using different databases, and also in some cases, the performance is better when using different databases.

Table 5.5: Influence of the lesion size (in

) for the proposed algorithms when training and testing using the same (MIAS) or different (DDSM) databases. The results show the mean and the standard deviation

values.

		Lesion Size (in )

		1.20	1.20-1.80	1.80-3.60	3.60

-\|\|--	*Eig (MIAS)*	$91.3\pm 7.4$	$90.3\pm 3.3$	$89.6\pm 4.7$	$85.5\pm 5.5$
	-\|\|--	*Eig $\&$ FPRed (MIAS)*	$89.9\pm 3.1$	$91.4\pm 2.1$	$88.5\pm 5.0$
	-\|\|--	*Eig (DDSM)*	$80.6\pm 9.8$	$91.1\pm 3.8$	$86.2\pm 4.5$
	-\|\|--	*Eig $\&$ FPRed (DDSM)*	$81.4\pm 9.4$	$89.4\pm 3.9$	$86.0\pm 5.5$
-\|\|--