Digital subtraction of temporally sequential mammograms for improved detection and classification of microcalcifications

Background Our aim was to demonstrate that automated detection and classification of breast microcalcifications, according to Breast Imaging Reporting and Data System (BI-RADS) categorisation, can be improved with the subtraction of sequential mammograms as opposed to using the most recent image only. Methods One hundred pairs of mammograms were retrospectively collected from two temporally sequential rounds. Fifty percent of the images included no (BI-RADS 1) or benign (BI-RADS 2) microcalcifications. The remaining exhibited suspicious findings (BI-RADS 4-5) in the recent image. Mammograms cannot be directly subtracted, due to tissue changes over time and breast deformation during mammography. To overcome this challenge, optimised preprocessing, image registration, and postprocessing procedures were developed. Machine learning techniques were employed to eliminate false positives (normal tissue misclassified as microcalcifications) and to classify the true microcalcifications as BI-RADS benign or suspicious. Ninety-six features were extracted and nine classifiers were evaluated with and without temporal subtraction. The performance was assessed by measuring sensitivity, specificity, accuracy, and area under the curve (AUC) at receiver operator characteristics analysis. Results Using temporal subtraction, the contrast ratio improved ~ 57 times compared to the most recent mammograms, enhancing the detection of the radiologic changes. Classifying as BI-RADS benign versus suspicious microcalcifications, resulted in 90.3% accuracy and 0.87 AUC, compared to 82.7% and 0.81 using just the most recent mammogram (p = 0.003). Conclusion Compared to using the most recent mammogram alone, temporal subtraction is more effective in the microcalcifications detection and classification and may play a role in automated diagnosis systems. Supplementary Information The online version contains supplementary material available at 10.1186/s41747-021-00238-w.


Background
Breast cancer screening with mammography is an effective approach to reduce breast cancer mortality. However, the large population involved and the use of double reading increases the workload and can limit the efficiency of the screening process [1]. To further exacerbate the challenge, various types of abnormalities are associated with breast cancer, including microcalcifications [2]. They can be benign or form microcalcification clusters possibly suspicious, to be managed appropriately. For the BI-RADS classification of microcalcifications as benign or suspicious, morphology, distribution, and change over time are key parameters [3]. Computeraided diagnosis (CAD) systems are being explored as a means to improve the specificity of the classification of mammographic anomalies without compromising the sensitivity [4]. Several groups [5][6][7] have assessed the use of CAD systems for the detection of microcalcifications present in the most recent mammographic views, with sensitivity and specificity in the range of 82−89% and 87−88%, respectively, when discriminating between benign and suspicious microcalcifications. The main drawback of those systems is the considerable number of false positives (FPs) per image, that can range up to 1-3, which reduces their clinical applicability [8]. In addition, they provide no information regarding the presence of those abnormalities in previous mammographic sessions.
In temporal analysis, mammograms from multiple prior examinations are utilised. When prior information is available for direct comparison, abnormalities can be detected at an earlier stage and the radiologists feel more confident of their assessment [9]. Some studies have attempted to combine information from prior and recent mammographic views to reduce the FP and recall rates [9][10][11][12][13][14]. Prior and recent images were coarsely registered based on anatomical features (e.g., nipple, skin, centre of mass which is the mean value across each dimension), and the locations of recently identified microcalcifications in the prior image were identified by regional registration, i.e., searching around the location of the recent finding. Combining the features from both images resulted in an improved specificity, specificity, and reduced FPs rates. However, temporal analysis offers no benefit, over using just the recent mammographic view, when the findings are new with no traces of abnormality in the prior screening [11].
Going a step further, temporal subtraction begins with using both global and local features to register the entire breast areas of the recent and prior mammographic views. This allows direct subtraction of the images, digitally removing unchanged regions from the recent mammographic view, further delineating subtle recent changes, including microcalcifications [15,16]. The objective of this study was to evaluate the effect of the subtraction of temporally sequential mammograms to eliminate unchanged features and improve the detection and classification of microcalcifications into benign and suspicious, based on their BI-RADS category. For comparison, the detection and classification methodologies were also applied to the most recent images without temporal subtraction.

Study population
The current study expands on prior published work [16]. From the 100 participants that were eventually included, 80 were previously used for the purposes of describing the technical details of the algorithms employed. For this retrospective study, 100 pairs of fullfield digital mammograms were collected, between 2012 and 2020, from various local hospitals (Nicosia General Hospital, Limassol General Hospital, Cyprus population screening program Aglantzia and Linopetra), performed by women (38 to 83 years of age, 60.07 ± 7.09, mean ± standard deviation) with either no microcalcifications or BI-RADS benign (normal population) or BI-RADS suspicious microcalcifications (suspicious population) in their current mammograms. A BI-RADS normal or benign prior mammogram (average interval of 2.2 years) was required for inclusion in the study (Fig. 1). The normal population was selected to form an age-matched group compared to the patients with BI-RADS suspicious microcalcifications. The study was approved by the appropriate Institutional Review Board (Cyprus National Bioethics Committee #EEBK ΕΠ 2020.01.144) and informed consent was retrospectively collected.
For every participant, two mammographic views (cranio-caudal and medio-lateral oblique) of the breast showing the presence of micorcalcifications, from two sequential screening rounds, were included for a total of 400 images. A radiologist (C.N. with 10 years of experience) identified the patients to be included, according to  the criteria specified above, and along with a second radiologist (G.S. with 2 years of experience), assessed the mammograms for assigning the BI-RADS category and marked the location of the microcalcifications. Interobserver agreement, i.e., microcalcifications marked by both observers, was estimated at 97.1% for benign (BI-RADS 2) and 98.2% for suspicious (BI-RADS 4-5) microcalcifications; differences were resolved by consensus. A summary of the study population is shown in Table 1. Fifty percent of the mammograms came from healthy participants (28 with only BI-RADS benign microcalcifications in the prior and recent mammographic views and 22 with no visible microcalcifications). In the remaining 50%, at least one new BI-RADS suspicious microcalcification was present in the most recent mammographic view. The ground truth was based on the BI-RADS category, as evaluated by the radiologists, without any confirmation by follow-up (for "benign" lesions), or biopsy (for "suspicious" lesions). This data set not only included temporally sequential mammograms, but also precise annotation of each individual microcalcification to be used as a reference (Fig. 2). The data set included a total of 629 microcalcifications, 515 BI-RADS 2 and 114 BI-RADS 4 or 5. The size of the mammographic views was 4096 × 3328 pixels, in an 8-bit DICOM format. This data is publicly available (https://doi.org/10. 5281/zenodo.5036062).

Image registration, subtraction, and segmentation
Supplemental Fig. S1 shows the diagram of the proposed methodology for detection and BI-RADS classification of microcalcifications. The procedure began with image preprocessing for normalisation, border removal [17], and gamma correction [18]. Next, each prior mammographic view was co-registered to the most recent one. Image registration is a critical step in temporal subtraction, since it corrects for the changes that occur in the breast over time and due to deformation between mammograms. Demons registration [19], a non-rigid method based on local flows, was employed due to superior performance compared to other common approaches [20]. The prior registered mammographic view was subsequently subtracted from the recent, effectively removing the regions that have remained unchanged since the previous exam. The contrast ratio of the subtracted image, i.e., the ratio of the maximum divided by the average intensity, was compared to that of the recent mammographic view to evaluate the effectiveness of the removal of the background. Furthermore, the capability of the subtraction to remove unchanged microcalcifications, thus reducing the FP rate, was also assessed. After range filtering [21], the intensity values of each image were converted to binary, i.e., 0 or 1, using an intensity threshold, obtained by optimising the global BI-RADS classification rate. The binary image was further processed morphologically. The operations of closing (removing small unconnected regions) and opening (filling small gaps) were applied and the remaining regions were considered as possible microcalcifications.

Feature extraction and selection for classification
To differentiate the true microcalcifications from other tissue and, subsequently, the BI-RADS benign from the BI-RADS suspicious microcalcifications, using machine learning algorithms, several features were estimated from every possible region containing microcalcifications. Ninety-six, shape, intensity first order statistic, and grey-level co-occurrence matrix (GLCM) features were extracted [22][23][24]. The GLCM was calculated at 0°, 45°, 90°, and 135°and 5-, 15-, and 25-pixel offset. The mean and standard deviation of each GLCM texture property were obtained, resulting in 72 features. Hypothesis test (t-test) [25] and feature importance [26] were employed to identify the most significant features and further evaluate their contribution to the classification.

Training and comparison of classifier designs
For the BI-RADS classification, 9 classifiers were evaluated: 9-nearest neighbors [1], decision trees [27], random forest [27], multi-layer perceptron [28], adaptive boosting [29], bagging [30], gradient boosting [31], and ensemble voting [32]. Rectified linear unit was used as an activation function and adaptive dropout regularisation was included every two hidden layers. Gaussian noise was added after dropout, as a regularisation term, in order to increase the robustness of the network. Batch size was set to 128 and the network was trained for 100 epochs. Traditional classifiers were selected based on their prior application to mammography. The complete dataset was used during the training stage with leave-one-patient-out (LOPO) crossvalidation. This cross-validation approach was critical in order to avoid bias from including images of the same patient in both test and training sets. K-fold crossvalidation was also considered, again by dividing the patients into folds. Initially, the possible microcalcifications were classified as normal tissue or true microcalcifications and, subsequently, the true microcalcifications were classified as BI-RADS benign or suspicious.
Detection and classification using the most recent mammogram alone For comparison purposes, the same classification approach was optimised and applied to the most recent mammograms, without temporal subtraction, to verify the benefit of temporal subtraction.

Statistical analysis
The classification performance was evaluated by computing the sensitivity, specificity, accuracy, and the area under the curve (AUC) at receiver operator characteristics analysis. The cut-off values for calculating sensitivity and specificity were selected optimally so that BI-RADS false positive and false negative numbers are minimised, i.e., the cross point of the positive and negative distributions. In the comparison of the results from using just the most recent mammogram, the Fisher test was used with a level of statistical significance set to p = 0.05.

Image registration, subtraction, and segmentation
Image registration and subtraction yielded an average 72% reduction of image intensity, a result of removing structures that have remained unchanged between screenings. The average contrast ratio of the subtracted images was~57 times higher compared to the recent mammographic view (Fig. 3). Eighteen percent of old BI-RADS benign microcalcifications were effectively removed ( Table 2). It is also important to note that none of the BI-RADS suspicious microcalcifications were removed by this process (Fig. 4). The processing time for these operations was an average of~15 min per image pair (Intel® Core™ i7 2 GHz; Intel Corp., Santa Clara, CA, USA).

Feature extraction and selection for classification
Features of microcalcifications were extracted from the images as described in the previous section. Based on the results of t-test and feature importance, the features with the most significant contribution to classification were identified for each classification round. The details of the features that have been selected in each case can be found in Supplemental Table S1.

Detection and classification using temporal subtraction
The results of the detection of the microcalcifications are summarised in Table 3. The sensitivity, specificity, accuracy, and AUC of the different methods were in the range of 72.8−82.3%, 83.7−97%, 83.5−95%, and 0.82 −0.88%, respectively. Based on the AUC, the most successful classification scheme was the ensemble voting with 81.4% sensitivity, 95.5% specificity, 94.1% accuracy, and 0.88 AUC. The application of k-fold cross-validation (Fig. 5a), using k = 4, 5, and 10, verified that the performance remained approximately at the same level.
The optimisation of the various classifiers for the classification of microcalcifications as benign or suspicious according to their BI-RADS category, using LOPO cross-validation, resulted in the outcomes summarised in Table 4. The sensitivity, specificity, accuracy and AUC were in the range of 57.9−84.2%, 77.8−92.2%, 79.0 −90.3% and, 0.70−0.87%, respectively. Based on the AUC, the most successful classification scheme was again ensemble voting with 81.6% sensitivity, 92.2% specificity, 90.3% accuracy, and 0.87 AUC. For this classification round, only 13-fold cross-validation was applied (Fig. 6a) due to the smaller number of patients with microcalcifications (i.e., 78 patients with true microcalcifications, 50 BI-RADS suspicious, and 28 BI-RADS normal with benign microcalcifications). Example images visually demonstrating the classification outcome, using temporal subtraction, where BI-RADS benign and suspicious microcalcifications were correctly identified, are shown in Supplemental Fig. S2.
Detection and classification using the most recent mammogram alone Table 3 shows the classification results for the identification of true microcalcifications using features selected only from the most recent mammographic view (Supplemental Table S1) and the same classifiers as before optimised for these features. The best classification performance was achieved using ensemble voting with 72.7% sensitivity, 96.5% specificity, 94.1% accuracy, and 0.83 AUC. Subsequently, the true microcalcifications were classified as BI-RADS benign or suspicious and the results are presented in Table 4. Again, the ensemble voting was the most successful method, providing 78.9% sensitivity, 83.5% specificity, 82.7% accuracy, and 0.81 AUC. As before, k-fold cross-validation was also performed, using the same values of k (Figs. 5b and 6b) showing that the algorithm was also stable and robust.

Discussion
A method for the detection and classification of microcalcifications, according to their BI-RADS category, from the subtraction of temporally sequential mammographic views was developed. The aim of this work was to combine temporal subtraction with machine learning in order to enhance the contrast ratio, eliminate the radiologically unchanged microcalcifications and, most importantly, improve the classification accuracy of microcalcifications as benign or suspicious, based on their BI-RADS categories. For effective and efficient subtraction of the prior from the most recent mammographic view, preprocessing, registration, and postprocessing procedures were applied. Machine learning techniques were then used to eliminate the FPs, i.e., normal tissue misclassified as microcalcifications, and, furthermore, to classify the true microcalcifications as BI-    RADS benign or suspicious. Bagging, gradient boosting, ensemble voting, and neural networks are more recent additions to machine learning and were chosen for their potential to provide improved classification. Ideally, the proposed method should have also been verified on an independent external dataset. However, publicly available datasets do not provide sequential mammograms or images annotated at the level of individual microcalcifications.
Demons registration [19] was very effective in matching the prior to the recent mammographic views, since it accounted for the complex transformations and distortions that appear between screenings. Using the proposed technique, the contrast ratio improved~57 times enhancing the contrast of the recent changes in the images. The elimination of most of the background and unchanged BI-RADS benign microcalcifications can make the radiologic evaluation of mammograms and the detection of even subtle abnormalities, easier and faster. This reduces the effort and time expended by the radiologist by enhancing the new and, most probably, more diagnostically useful information. With automated BI-RADS classification, the proposed method achieved 81.4% sensitivity, 95.5% specificity and 94.1% accuracy for the detection of true microcalcifications and 81.6% sensitivity, 92.2% specificity, and 90.3% accuracy for the classification of microcalcifications as BI-RADS benign or suspicious, both using an optimised feature set and an ensemble voting model. This high accuracy demonstrates the effectiveness of the algorithm, which provided statistically superior performance over using only the most recent mammograms. The average classification accuracy for the characterisation of true microcalcifications as BI-RADS benign or suspicious improved by 7% with the introduction of temporal subtraction (p = 0.003). To evaluate the robustness of the method, k-fold cross-validation was implemented, in addition to LOPO cross-validation. The algorithm performed at approximately the same level in all the crossvalidation scenarios, indicating that the proposed method is robust and should be able to function equally well as new data becomes available.
The main limitation of this study is the relatively limited dataset acquired from local hospitals with a single protocol. Even though the results presented here are promising, more sequential pairs are required to definitively prove the generalizability of the proposed algorithm. Unfortunately, publicly available databases cannot be exploited for the purposes of this project, since they neither contain sequential mammograms nor they include images annotated at the level of individual microcalcifications as in this study. Another limitation is the fact that although the suspicious microcalcifications were identified by two expert radiologists, the BI-RADS classification of clusters of microcalcifications not only varies from one radiologist to another, but might also be disproved by follow-up or pathology. In addition, the adoption of the BI-RADS classification as the ground truth, without any confirmation by follow-up or pathology, limits the generalizability and the ultimate utility of the tool. This is a consequence of mimicking the human reader rather than offering an unbiased opinion based on the true and confirmed state of the microcalcifications.
The results of this study cannot be readily and directly compared to other state-of-the-art techniques described in the literature for various reasons. The existing and freely available image databases contain only one mammogram per patient (i.e., no prior information). In addition, in several cases, entire images are classified rather than individual microcalcifications [34]. Furthermore, in most state-of-the-art algorithms, crossvalidation is implemented by randomly dividing the microcalcifications into training and test sets, or by using part of the same image in the test and another part in the training set [35]. Such approaches can introduce bias, which results in artificially improved classification results. In general, most studies in the literature report accuracy and AUC, which in the case of benign versus malignant classification of microcalcifications range Table 4 Comparison of the classification results of the true microcalcifications as BI-RADS benign or suspicious using temporal subtraction (TS) of mammograms and using only the most recent mammograms (RM), in a leave-one-patient-out cross-validation scheme between 80−89% and 0.86−0.92% (Supplemental Table  S2). In this study, a more appropriate approach was selected, which assigned the entire data set corresponding to a patient either to the training or the test set, and performed LOPO cross-validation. In order to prove the effectiveness of temporal subtraction in a fair manner, the results were compared to an optimised algorithm using only the most recent mammograms of the same dataset.
There are no examples of temporal analysis for microcalcification identification and classification in the literature. Furthermore, there are only a few studies on the use of temporal analysis for the characterisation of mass lesions [9][10][11]13]. Rather than image subtraction, their approach was to extract several features from the recent and prior mammograms separately and, then, combine them to improve the classification accuracy. Their results confirmed that the use of prior information could improve the detection and classification of mass lesions. However, temporal analysis offers no benefit when there is no abnormality in the prior screening.
In conclusion, the proposed technique demonstrates that temporal subtraction achieves superior performance in the detection and classification of microcalcifications, based on their BI-RADS category, compared to using only the most recent mammograms. The inclusion of more patients as well as the extension of the approach to detect and classify other abnormalities in mammograms (e.g., breast masses or distortions) can further enhance the diagnostic potential of temporal subtraction. In the future, the proposed methodology has the