Automatic segmentation and classification of breast lesions through identification of informative multiparametric PET/MRI features

Background Multiparametric positron emission tomography/magnetic resonance imaging (mpPET/MRI) shows clinical potential for detection and classification of breast lesions. Yet, the contribution of features for computer-aided segmentation and diagnosis (CAD) need to be better understood. We proposed a data-driven machine learning approach for a CAD system combining dynamic contrast-enhanced (DCE)-MRI, diffusion-weighted imaging (DWI), and 18F-fluorodeoxyglucose (18F-FDG)-PET. Methods The CAD incorporated a random forest (RF) classifier combined with mpPET/MRI intensity-based features for lesion segmentation and shape features, kinetic and spatio-temporal texture features, for lesion classification. The CAD pipeline detected and segmented suspicious regions and classified lesions as benign or malignant. The inherent feature selection method of RF and alternatively the minimum-redundancy-maximum-relevance feature ranking method were used. Results In 34 patients, we report a detection rate of 10/12 (83.3%) and 22/22 (100%) for benign and malignant lesions, respectively, a Dice similarity coefficient of 0.665 for segmentation, and a classification performance with an area under the curve at receiver operating characteristics analysis of 0.978, a sensitivity of 0.946, and a specificity of 0.936. Segmentation but not classification performance of DCE-MRI improved with information from DWI and FDG-PET. Feature ranking revealed that kinetic and spatio-temporal texture features had the highest contribution for lesion classification. 18F-FDG-PET and morphologic features were less predictive. Conclusion Our CAD enables the assessment of the relevance of mpPET/MRI features on segmentation and classification accuracy. It may aid as a novel computational tool for exploring different modalities/features and their contributions for the detection and classification of breast lesions. Electronic supplementary material The online version of this article (10.1186/s41747-019-0096-3) contains supplementary material, which is available to authorized users.


Key points
The positron emission tomography/magnetic resonance imaging (PET/MRI) computer-aided segmentation and diagnosis (CAD) system automatically detects, segments, and classifies breast lesions. Automatic lesion segmentation was accurate and improved with information from all modalities. A small number of features mainly from dynamic contrast-enhanced MRI achieves high classification accuracies. The PET/MRI-CAD system allows exploring the value of different imaging modalities and features.

Background
Breast cancer is the most common cancer and the second most common cause of mortality from cancer in women [1]. Early detection and precise diagnosis are important for effective treatment [2], and breast imaging plays a pivotal role in the detection, characterisation, and staging of breast cancer. Recently, multimodal, multiparametric imaging (mpI) including dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI), diffusion-weighted imaging (DWI), and positron emission tomography (PET) has been investigated for an improved differentiation of benign and malignant breast lesions [3]. Such imaging constitutes complex protocols but is promising for a more comprehensive measurement of morphology (MRI), neoangiogenesis (DCE-MRI), tumour metabolism (PET), and microstructure (DWI) in cancerous and benign tissue [3] (Fig. 1).
Due to the increased complexity of the information captured by mpI, computational approaches that enable the quantitative assessment of multivariate measurements have been gaining relevance. Recently, computer-aided detection and diagnosis systems have been proposed to reduce inter-and intra-reader variability and to aid radiologists in the detection and diagnosis of breast cancer [4]. These systems are able to analyse large amounts of imaging data in a short time, detect and visualise complex correlations and patterns, and provide objective and repeatable measurements [5] to increase the accuracy of diagnosis [6]. Computer-aided detection (CADe) systems assist radiologists in localising suspicious regions in medical images, whereas computer-aided diagnosis (CADx) systems support the radiologist in the diagnosis of suspicious regions by providing and analysing information extracted from these regions [7]. These systems show potential to be advantageous in the current clinical scenario [7] where despite guidelines for DCE-MRI, such as the Breast Imaging-Reporting and Data System (BI-R-ADS®) MRI lexicon [8], inter-and intra-reader variability remains an issue and the human analysis of complex relationships observed in images and the underlying disease remains limited [9]. As yet, the information provided by individual imaging techniques as part of mpI remains poorly understood. To identify the diagnostically relevant parameters captured across DCE-MRI, DWI, and 18 F-fluorodeoxyglucose ( 18 F-FDG)-PET, we propose a novel automated data-driven approach: a combined breast lesion segmentation and classification system for mpI data where the system automatically identifies the information in the imaging data that contribute to an accurate segmentation and classification.

Patients
The data used in this retrospective analysis was acquired from an institutional review board-approved prospective, single-institution study [25]. All patients gave written informed consent. At the time of the prospective study, only prototypic PET/MRI scanners were in existence and these were not available at the study centre. Thus, 46 patients were included in this prospective study in which MRI and a combined computed tomography (CT)/ 18 F-FDG-PET were acquired. All tumours were histopathologically verified. In our retrospective analysis, the CT image was used only as morphologic information for the registration and was purposely not part of segmentation and classification. After applying our automatic CT to MRI registration method, as described below, twelve patients had to be removed from analysis due to registration errors. All excluded cases were patients with large breasts that were considerably compressed, or deformed, in one of the modalities during image acquisition. Misalignments were detected visually by overlaying MRI and CT images. From the remaining 34 patients, 12 had benign lesions and 22 had malignant lesions (2 patients had multifocal or multicentric cancer). Characteristics of the lesions are listed in Table 1.

Image acquisition
Patients underwent 3T MRI (Tim Trio, Siemens, Erlangen, Germany) in prone position using a four-channel breast coil (InVivo, Orlando, FL, USA) and a combined whole-body PET/CT in-line system (Biograph 64 True-Point®; Siemens, Erlangen, Germany) in prone position.
For DCE-MRI a split dynamics protocol that combined high-spatial and high-temporal resolution was used [11]. First, a high spatial resolution, pre-contrast coronal T1-weighted turbo three-dimensional fast low angle shot (FLASH) sequence with water-excitation and fat-suppression was acquired with matrix 320 × 320 × 120 and 1-mm isotropic voxel (DCE-MRI pre-contrast imaging, I dce-pre ). Subsequently, a DCE coronal T1-weighted volumetric interpolated breath-hold-examination (VIBE) sequence with 17 acquisitions (13.2 s per acquisition) was acquired with matrix 192 × 192 × 72 mm and 1.7-mm isotropic voxel (DCE-MRI, I dce ). Seventy-five seconds after the beginning of the sequence, gadoterate meglumine (Gd-DOTA, Dotarem®, Guerbet, Paris, France) was injected as a bolus at a dose of 0.1 mmol/kg at a rate of 4 mL/s and followed by a 20-mL saline flush at the same injection rate. Then, a FLASH sequence was acquired to capture the peak enhancement of lesions (DCE-MRI peak-contrast imaging, I dce-peak ), followed by a VIBE sequence with the same parameters above described. Finally, a FLASH sequence with the same parameters above described was acquired (DCE-MRI post-contrast imaging, I dce-post ) to depict delayed enhancement lesion morphology. DWI sequences were acquired in the same session, with b values of 50 and 850 s/mm 2 , resulting into two datasets, I dwi b0 and I dwi b850 , as well as the derived apparent diffusion coefficient (ADC) mapping, I adc [12] (matrix 172 × 86 × 24, pixel 2.09 × 2.09 mm, slice thickness 5.5 mm). 18 F-FDG-PET (matrix 168 × 168 × 74, pixel 4 × 4 mm, slice thickness 3 mm) and CT images (matrix 512 × 512 × 74, pixel 1.37 × 1.37 mm, slice thickness 3 mm) of the thorax were acquired in a hybrid PET/CT scanner and were aligned by the scanner software.

CAD pipeline
We developed a novel automated data-driven combined CADx system for mpI data with MRI and PET. The system enabled automatic detection and segmentation of potentially cancerous regions and classified lesions as benign or malignant. The algorithm first aligned multimodal breast imaging data from DCE-MRI, DWI, and 18 F-FDG PET non-rigidly, and segmented the breast. Then, the system extracted local textural, kinetic, and intensity-based image features from the fused information and detected and classified lesions using a random forest (RF) classifier [10].

Alignment
To collect information at individual positions across modalities, all images were aligned to I dce-pre serving as  reference coordinate system. Images were registered with the software package Advanced Normalisation Tools (ANTs) [13] using an affine transformation with mutual information as the similarity metric, followed by a non-rigid deformation with symmetric normalisation (SyN) [13] and windowed normalised cross-correlation as a similarity metric (Fig. 3a). As I pet does not provide morphologic information, we registered the corresponding CT image to I dce-pre [14] and subsequently applied the obtained transformation on I pet .

Lesion segmentation
We treated lesion segmentation as a voxel-wise classification problem, where a machine learning algorithm assigned a binary label 1 (lesion) or 0 (non-lesion) to each voxel based on imaging features extracted at that location. As ground truth for training and validation, we used manual expert radiologist (with 3 years of experience) annotations performed on the registered I dce-peak or I dce-post , depending on where the lesion borders were better visible. Annotations were validated by a second expert radiologist with 9 years of experience. All computations were restricted to the breast area, which was segmented using an intensity-based growing region algorithm [15]. All MRI intensity values were standardised to zero mean and unit standard-deviation estimated from the breast area on the pre-contrast images, I dce-pre and I dce . We computed intensity features from all imaging data, from changes of the contrast over time and the summed up contrast in the DCE-MRI sequence as specified in Table 2.
An RF classifier model was trained on features extracted from 1000 randomly selected samples per class and patient. The trained model was then used to predict the segmentation label for a new patient who was not part of the training data set for each voxel x of the breast based on the computed features ( Fig. 3b).

Lesion classification
After segmentation, the lesion was classified as either benign or malignant based on features extracted per lesion. Intensity-based, kinetic, morphological, and textural features were considered to train a lesion class prediction model, and the obtained model was used to predict malignancy for lesions in the new patient who was not part of the training data set.
Intensity-based features were calculated from DCE-MRI, DWI ADC, and the 18 F-FDG-PET map. We tackled the lesion inhomogeneities in the contrast enhancement of DCE-MRI by the method described by Chen et al. [16], where the signal-to-time curves within a lesion were clustered by the fuzzy c-means algorithm and the curve with highest contrast enhancement rate, the characteristic kinetic curve, was chosen for classification. We used the 25 time points beginning with contrast enhancement (f lckc ) and the change over time (f lδckc ) calculated by forward difference (four frames) as intensity features. Analogously, I adc and I pet intensities were partitioned into five clusters and the cluster centre with the lowest ADC value and the highest 18 F-FDG uptake were used as features f l-adc and f l-pet .
To capture contrast enhancement kinetics, we fitted an asymmetric generalised logistic function as regression function multiplied with an exponential term to the characteristic kinetic curve: where G defines the scaling, α the asymmetry parameter,  τ the steepness, and t 1/2 the time of half maximum of the sigmoid function; k defines the terminal slope and β scaling factor of the exponential term (Additional file 1: Figure S1). We used the parameters α, τ, β, and k as features (f lkinetic ). In addition, we computed summary measures of the curve within a 7-min interval, beginning at start of contrast enhancement: area under the curve (AuC), maximum enhancement (C max ), time to maximum enhancement (T max ), time to half maximum enhancement (T 1/2 ), and maximum analytical derivative δC δt of the regression function C(t) (MDER).
To obtain textural features, f l-texture, we used a volumetric texture analysis approach based on grey-level co-occurrence matrix (GLCM) and Haralick texture features [17,18]. We computed the GLCM with 128 Gy-value bins and 26 neighbours within the lesion and used its 13 s-order statistics [17]. f l-tex-pre , f l-tex-peak , and f l-tex-post contained the Haralick features obtained from the I dce-pre , I dce-peak , and I dce-post intensity values, respectively.
In addition to the spatial texture analysis, we used a novel temporal texture analysis inspired by the works of Agner et al. [19] and Woods et al. [20]. With this analysis, we characterised the temporal properties of contrast uptake within a lesion, e.g., homogeneity of contrast uptake. To compute the GLCMs, we considered voxel pairs at the same spatial position x but at different time points in the contrast enhancement. We computed the Haralick features from pixel pairs from (I dce-pre , I dce-peak ), (I dce-pre , I dce-post ), and (I dce-peak , I dce-post ), resulting in the feature vectors f l-tex-peak/pre , f l-tex-post/pre , and f l-tex-post/peak .
To obtain morphological feature candidates, f lmorph , we used shape descriptors, as utilised previously in the literature [19,21,22]. Definitions of the shape descriptors are given in Additional file 1: Table S1.

Evaluation of lesion segmentation and classification
To evaluate lesion segmentation, we performed experiments in a leave-one-out cross-validation (LOOCV) fashion, training the segmentation algorithm and feature rankings on all but one example, and applying it to the remaining example not included in the training. The quality of the segmentation was measured on a pixel level by comparing the predicted segmentation with the manually annotated data using Dice similarity coefficient (DSC) [23] as a similarity measure and sensitivity (truepositive rate) describing the probability of detection. As RF provide probabilities, we determined the RF threshold as the one that maximises DSC on the training set. Overall performance was obtained by computing the mean of all test DSC scores.
To evaluate lesion classification, we classified lesions into the two classes: benign and malignant. Evaluation was performed in an LOOCV fashion for both ranking the features and determining accuracy. Accuracy was reported as receiver operating characteristic (ROC) area under the curve (AUC) and sensitivity/specificity. The RF threshold was chosen within the training set as the one maximising the F 1 score, which is the harmonic mean of precision and sensitivity. All experiments were repeated 20 times, and averages for AUC and sensitivity/ specificity are reported. To study the impact of segmentation accuracy on classification, we performed classification on both manually delineated lesions and automatically segmented lesions.
In a post-processing step, false-positive blobs were removed by computing connected-components from the segmentations using a six-neighbourhood, and only blobs that partially overlapped with the manual annotation were selected. This step mimics the manual selection of a suspicious region that a radiologist wants to investigate further. For the two benign cases where the lesion was not detected, manual segmentation was used instead of the automatic segmentation. This post-processing step allowed us to evaluate classification accuracy independent of the segmentation performance.

Evaluation of feature contribution
We then evaluated the contribution of features collected across the mpI data and ranked their contribution to segmentation and classification based on two measures: (1) RF Gini importance (GI) [10] and (2) minimumredundancy-maximum-relevance (mRMR) [24]. The GI measures the average amount of information gain using the Gini index splitting criterion during RF training and ranks the contribution of each feature as part of a multivariate pattern. If features are redundant but informative, it ranks all of them highly [25]; the mRMR provides a ranking based on relevance and redundancy of the features. Then, we successively increased the number of features for training and validation, beginning with the top-ranked feature, and measured the performance of each model, thus allowing us to assess the contribution of each individual feature in a multimodal, multiparametric setup. In addition, the benefits of multiparametric and multimodal features were evaluated by training models using only DCE-MRI features and combined DCE-MRI, DWI, and/or 18 F-FDG PET features.

Lesion segmentation
We report in Table 3 and illustrate in Additional file 1: Figure S2 Fig. 4, the missed benign lesions had a very low contrast uptake and thus were missed by the prediction models. The performance of the GI and mRMR feature selection models with an increasing number of highest-ranked features is shown in Fig. 5a. The performance of the GI feature selection model peaked at only three features whereas the performance of the mRMR feature selection model peaked at six features. Table 4 shows the ranking of the features according to GI and mRMR. Both algorithms ranked f dwi , f nsum-dce , and I dce-post highly. However, mRMR tended to pick more varied features than GI, where GI selected six potentially correlated features from f dce as part of the top 10 features. The features capturing changes in the contrast, f δdce and f δmri , received a lower ranking in GI (see also Fig. 5b) compared with mRMR.

Lesion classification
In Table 5, we list the results for the models showing the highest ROC AUC score after GI and mRMR feature selection. Overall, for manually annotated lesions, mRMR feature selection yielded the highest AUC (0.978) using only two features, with a sensitivity of 94.6% and specificity of 93.6% for identifying  The performance of the GI and mRMR feature selection models with an increasing number of highest-ranked features is shown in Fig. 6a. The mRMR feature selection model peaked at only two features whereas the GI feature selection model peaked at four features, with a subsequent decrease in AuC performance. A closer look at the ranking of the features (Table 6 and Fig. 6b) indicates that features from the pool of kinetic (f l-kinetic ) and textural (f l-texture ) features were top-ranked by GI and mRMR models. Morphologic (f l-morph ) and PET (f l-pet ) features received a low ranking by GI and mRMR models. The DWI ADC feature (f l-adc ) was ranked as an important feature by GI in automatic segmentation only.

Discussion
We present a novel data-driven combined breast lesion segmentation and classification system for mpI data with combined 18 F-FDG-PET/MRI. This system automatically detects and segments potentially cancerous regions and classifies lesions as benign or malignant. Our results showed that automatic lesion segmentation was accurate and improved with information from all modalities, but even a small number of features were sufficient to achieve the reported maximum accuracy. On the other hand, our results showed that lesion classification largely drew on information from DCE-MRI, without benefitting from information from other modalities and parameters. The results are consistent with previous findings but add insights into the feasibility of a completely automated lesion segmentation and of classification from mpI data. The results were obtained by quantifying the information captured across multimodal mpI data and features, enabling the assessment of imaging protocols in this context. Using combined mpI based on DCE-MRI, DWI, and 18 F-FDG-PET in a CADe or CADx system is a novel promising approach for improving diagnostic accuracy [26]. Previously, CADe and CADx systems have been proposed for digital mammography to increase the rather moderate sensitivity [27] and to help in classifying lesions as benign or malignant [28]. Semi-automatic methods have been proposed for classifying each pixel as cancerous or non-cancerous using fuzzy c-means clustering [29] or Markov random field-based clustering of the time-series [30]. Moreover, methods designed to outline lesions using the active contour framework (i.e., autonomously and adaptive search of object contours based on image features and user interaction) have also been presented [31,32]. Automatic segmentation methods, which may also be seen as CADe systems, have been proposed using machine-learning approaches based on intensity and textural features (co-occurrence, run-length) [20,[33][34][35]. Recently, an automated localisation of breast cancer lesions based on DCE-MRI was proposed by Gubern-Mérida et al. [36]. Multimodal approaches combining several modalities have been reported for PET/CT breast images: Han et al. [37] segmented lesions by applying a graph-based Markov random field method on a combined PET/CT image, taking advantage from both the high spatial resolution of CT and the functional information of PET. Lastly, several CADx methods that classify breast lesions as benign or malignant by exploring the DCE-MRI data have been proposed using morphology [38], lesion texture [39], contrast enhancement [16,40], a combination of morphology and contrast enhancement [41], or a combination of morphology and texture [19,21,31,42,43]. State-of-the-art DCE-MRI CADx methods have been reported using various performance metrics, different datasets (e.g., malignant cases only), and differing aims (i.e., segmentation versus detection).
Using our system, we detected all malignant cases and missed two benign lesions. Detected lesions were classified as malignant with a sensitivity of 95%. Using texture features, Woods et al. [20] and Yao et al. [35] previously reported an ROC-AUC of 0.999 and 0.984, respectively. However, Woods et al. performed the evaluation on the same subjects as used in training, and both these studies were conducted in a small set of malignant lesions only. Twellmann et al. [33] reported a ROC-AUC of 0.99 for lesion detection using LOOCV and DCE-MRI information. Vignati et al. [34] reported the performance of a fully automated system as a detection rate of 0.89 and a sensitivity of 0.98 at four false-positive cases per breast. In their study, the performance measure did not include false-positive areas. Gubern-Mérida et al. [36] used an automated method and achieved a sensitivity of 89% at four false-positive per normal case. As normal cases, Table 6 The ten top-ranked classification features according to Gini importance and minimum-redundancy-maximum-relevance they included patients with a BI-RADS rating of 1 or 2, who were healthy subjects with benign findings.
For the task of automatic lesion segmentation, our study showed that mpI is beneficial, as evidenced by the increase of the DSC from 0.584 to 0.665. The high ranking of DWI features in both GI and mRMR feature selection models indicates that the addition DWI to DCE-MRI is especially beneficial in segmentation. We also found that lesion segmentation benefitted from the addition of PET, although the benefit was to a lesser extent than that of DWI. When both DWI and PET were added, the DSC was further improved; thus, our results suggest that PET has a complementary relationship with DWI. Interestingly, features describing the change of contrast between time-steps (f δdce and f δmri ) received a good ranking in the mRMR feature selection model overall but a low ranking in the GI feature selection model. A likely reason is that while they contribute less information than the higher-ranked GI features, their contribution is orthogonal to the higher-ranked features. In our study, mRMR as a feature selection model provided slightly better results than GI. The moderate mean DSC score for lesion segmentation results from several reasons. First, the two undetected benign lesions exhibited very low contrast enhancement with a DSC of 0, leading to a drop in the mean value. However, we kept these two benign cases in the dataset to evaluate whether additional parameters may allow the system to segment these challenging cases, which was not the case as reported. Second, additional areas of contrast uptake, such as vessels and enhancing parenchymal tissues, resulted in an increased false-positive rate. While DWI and 18 F-FDG-PET image modalities increased automatic segmentation accuracy, mainly by reducing the false-positive cases, lesions with low contrast uptake could not be detected automatically. As good segmentation is important for the accurate classification of a lesion, we aim to improve the segmentation performance, e.g., by introducing heuristics that filter false-positive cases in a post-processing step in a future study, as proposed for instance by Vignati et al. [34] and Gubern-Mérida et al. [36] where morphologic and kinetic descriptors were used in a second step.
In our study, a high accuracy in lesion classification was achieved for both expert and automatic segmentation. However, the highest accuracy was achieved with manual segmentation and mRMR feature selection from DCE imaging data. Top-ranked features largely overlapped between GI and mRMR feature selection models; the exception was that f l-adc was ranked highly by the GI feature selection model following only automatic segmentation. While the addition of DWI and 18 F-FDG-PET to DCE-MRI was beneficial overall for lesion segmentation, lesion classification only improved slightly with these two modalities for GI feature selection following manual segmentation. Lesion classification for mRMR features selection was best without these two modalities. f l-pet was lowly ranked, consistent with recent findings by Magometschnigg et al. [44] that indicate that quantitative 18 F-FDG-PET values are not helpful for breast cancer classification. On the other hand, the kinetic feature f l-kinetic received a high GI as well as high mRMR ranking. Textural features were top-ranked, mostly from f l-tex-post/peak . The top-ranked feature, GLCM energy, measures the uniformity of lesion texture, reflecting the uniformity of contrast-enhancement within the lesion during a later stage. The morphologic feature f l-morph scored very low, although they are an integral part of the BI-RADS® lexicon for lesion classification, being discriminative features for clinical diagnosis, as shown by Pinker-Domenig et al. [45]. This suggests that binary segmentation and shape descriptors are not precise enough to describe the shape and margin of the lesion and feature extraction from a soft-margin around the hard segmentation border (e.g., textural features) may better capture the BI-RADS margin descriptors (circumscribed, non-circumscribed, irregular, spiculated). Alternatively, digital mammography or digital breast tomosynthesis may be used as an additional higher resolution modality to assess the morphology of the lesion more accurately. To summarise, mRMR slightly outperformed GI as a feature selection method for breast lesion classification. Novel DCE-MRI features that describe the kinetics and spatio-temporal texture of the contrast uptake were highly predictive for the classification of benign and malignant lesions, whereas DWI and PET did not provide additional information. Whereas we used data from separate MRI and PET/CT scanners, the methods, results, and findings can be directly transferred to images obtained at combined PET/MRI scanners, as the CT information was used for alignment only and was not part of the decision models.
One limitation of the study is that only subjects with suspicious findings on mammography or breast ultrasonography were included. As a consequence, an assessment of false-positive cases in healthy subjects was not possible. However, the majority of tissue in the breast consists of healthy tissue, on which the classifier was trained, and was classified as healthy tissue in our study. A second limitation is the small number of subjects. Even though cross-validation allowed us to estimate the generalisation of the model to some degree, statistical significance can only be obtained from a larger cohort. Thus, we aim to confirm our preliminary findings on a larger number of patients in a future study.
In conclusion, we used an entirely data-driven approach in combination with the assessment of the contribution of individual imaging parameters to provide a means for