Diagnostic performance of machine learning applied to texture analysis-derived features for breast lesion characterisation at automated breast ultrasound: a pilot study

Background Our aims were to determine if features derived from texture analysis (TA) can distinguish normal, benign, and malignant tissue on automated breast ultrasound (ABUS); to evaluate whether machine learning (ML) applied to TA can categorise ABUS findings; and to compare ML to the analysis of single texture features for lesion classification. Methods This ethically approved retrospective pilot study included 54 women with benign (n = 38) and malignant (n = 32) solid breast lesions who underwent ABUS. After manual region of interest placement along the lesions’ margin as well as the surrounding fat and glandular breast tissue, 47 texture features (TFs) were calculated for each category. Statistical analysis (ANOVA) and a support vector machine (SVM) algorithm were applied to the texture feature to evaluate the accuracy in distinguishing (i) lesions versus normal tissue and (ii) benign versus malignant lesions. Results Skewness and kurtosis were the only TF significantly different among all the four categories (p < 0.000001). In subsets (i) and (ii), a maximum area under the curve of 0.86 (95% confidence interval [CI] 0.82–0.88) for energy and 0.86 (95% CI 0.82–0.89) for entropy were obtained. Using the SVM algorithm, a maximum area under the curve of 0.98 for both subsets was obtained with a maximum accuracy of 94.4% in subset (i) and 90.7% in subset (ii). Conclusions TA in combination with ML might represent a useful diagnostic tool in the evaluation of breast imaging findings in ABUS. Applying ML techniques to TFs might be superior compared to the analysis of single TF. Electronic supplementary material The online version of this article (10.1186/s41747-019-0121-6) contains supplementary material, which is available to authorized users.


Background
In women with dense breast tissue, the combined use of mammography and hand-held ultrasound (HHUS) in breast cancer screening boosts breast cancer detection rate with additionally detected 2-4 cancers per 1,000 women [1][2][3][4]. However, the use of HHUS in the screening setting remains controversial due to its inherent limitations including the lack of standardisation and the necessary level of operator experience [4,5]. In recent years, automated breast ultrasound (ABUS) has been introduced to overcome some of HHUS limitations. ABUS provides technique standardisation via the acquisition of standardised views as well as scanning parameters and resolves the issue of operator subjectivity and variation [6]. Nevertheless, interpretation of imaging findings remains highly dependent on reader skills and experience. Standardised acquisition in terms of scanning parameters (e.g., focus, gain) offers the opportunity to apply tools for image analysis that can support the characterisation of imaging findings.
Texture analysis (TA) is an integral part of the emerging field of radiomics and allows a quantitative and objective assessment of tissue heterogeneity by evaluating the distribution and relationship of pixel or voxel grey levels in the image [7,8]. In most of the cases, methods based on statistical analysis are used to represent the interdependence of grey-level values. TA applied to computed tomography and magnetic resonance imaging has already shown promising results in predicting pathologic features, prognosis and response to therapy for various diseases and body compartments and can potentially be used in ABUS imaging for lesion analyses [9][10][11][12][13][14][15][16]. Moreover, machine learning (ML) can be applied to data from TA such that algorithms are trained to learn specific patterns and categorise the imaging findings [17].
In this context, the primary purpose of our study was to determine if features derived from TA can be used to distinguish normal tissue, malignant and benign solid lesions in ABUS. Second, we evaluated whether ML applied to TA data can accurately categorise ABUS findings. Third, we compared ML to the analysis of single texture features to categorise ABUS finding based on TA.

Study subjects
The local ethics board approved this retrospective study ("Kantonale Ethikkommission Zurich"; Approval Number: 2016-00064). The need for informed consent was waived. Between December 2015 and June 2017, all women with at least one histologically proven malignant lesion (n = 27; median age 54 years; range 30-85 years) who underwent ABUS imaging were identified from the hospital database (University Hospital Zurich). An equal number of women (n = 27) with at least one benign solid lesion (median age 44 years; age range 27-73 years) who underwent ABUS during the same study period were also included. In case of a malignant lesion, the histological type was collected. All benign solid lesions had to be either histopathologically proven fibroadenomas or stable lesions with a followup of at least 24 months. ABUS was performed in addition to mammography in 39 women with American College of Radiology breast density category c or d [18] undergoing screening examination and as unique imaging examination in 15 women younger than 40 years undergoing routine controls. None of the patients was symptomatic or had strong family history of breast cancer (i.e., no BRCA1 or BRCA2 mutation carriers, no first-degree relatives of BRCA1 or BRCA2 mutation carriers, and no women with three or more events of ovarian cancer or male breast cancer or breast cancer in women younger than 60 years in first-or second-degree relatives in either maternal or paternal line). The maximum diameter in ABUS was annotated for all lesions.

ABUS examination
Images were acquired with ABUS (Invenia™ Automated Breast Ultrasound System, General Electric Healthcare, Sunnyvale, CA, USA) using a C 15-6XW reverse curve, 5-14 MHz transducer with an aperture length of 15.3 cm, a transducer travel distance of 16.9 cm, and a depth up to 5 cm. An abundant layer of water-based lotion is applied to the breast in order to maximise the coupling between the transducer and the skin. The standard acquisition included three volumes per breasts, so-called anteroposterior, lateral, and medial in order to guarantee coverage of the entire breast. Slices had a thickness of 0.5 mm. Volume acquisitions were performed in the axial plane, and the 3D reconstructions in the sagittal and coronal planes were automatically provided using a dedicated workstation.

Image selection and texture analysis
All axial images encompassing the lesion in the three volumes were analysed separately. Images in which the visibility of the lesion was altered because of artefacts (i.e., inadequate compression during the volume acquisition or inadequate lotion with impaired acoustic coupling at the contact surface between the transducer and the skin) were excluded from the analysis (n = 63). These images were in general only part of a patient examination (e.g., two to three images in one of the volumes) and did not determine any complete exclusion of patients. Normal fat and fibroglandular tissue were evaluated in two additional, arbitrarily selected images for each patient, usually from the upper outer quadrant (in patients with malignant lesions in the contralateral breast) in order to evaluate the texture features of normal breast tissue. The image selection was performed by a radiologist with 8 years of experience in breast imaging and 3 years of experience in ABUS imaging.
TA was performed in MATLAB (v2016b, The Math-Works Inc., Natick, MA, USA) with an established routine-based procedure, as already described [19,20]. A region of interest (ROI) was drawn freehand by a radiologist (with 8 years of experience in breast imaging) who delineated the outer edge of the lesion or the maximal continuous area of fibroglandular or fat tissue included in a single image. A second radiologist (with 7 years of experience in breast imaging) performed the same evaluation in five benign and five malignant lesions. In order to minimise intrascanner effects, ROI content normalisation between the mean and three standard deviations (μ ± 3 σ) was performed as a first step of the TA [21,22]. Subsequently, 47 features were computed [9] ( Table 1). The first order features (entropy, variance, skewness and kurtosis) were directly extracted from the histogram of all grey levels in the ROI. The second and high-order features were derived from the respective grey-level matrices (i.e., grey-level co-occurrence matrix [GLCM]; grey-level run length matrix [GLRLM] and grey-level size zone matrix [GLSZM]) and included more information concerning grey-level distribution by accounting for the relative position of each pixel with respect to the other pixels of the image [9,23].

Machine learning Data preparation
Preprocessing and preparation of the dataset for ML were performed with routines written in Python and Scikit-learn (www.scikit-learn.org, release 0.18.1). All features obtained from texture analyses were standardised for the whole dataset using the Scikit-learn-embedded "StandardScaler" class, by removing the mean and scaling the data to unit variance. To account for multiclass classification, the dataset with four classes (malignant lesions, benign solid lesions, fat tissue, glandular tissue) was split into two balanced subdatasets, each consisting of two classes: (i) solid lesions versus normal fat and glandular tissue and (ii) malignant lesions versus benign solid lesions. To measure the unbiased performance of the classifier each sub-dataset was randomly shuffled and split in a stratified manner into training and validation partition, with a ratio of 0.8-0.2. The validation partition was excluded from the training process, serving as "unseen" real-world data. Thereby, special attention was put on the fact that each TA dataset in each validation partition was acquired from an individual patient.

Support vector machine classifier
An ML model based on the support vector machine (SVM) algorithm with radial basis decision function and fivefold cross-validation was implemented using Scikit-learn. In order to determine the optimal hyperparameters for the SVM, a nested grid search on each fold was implemented Table 1 First order and second and high order texture features by specifying the parameter for gamma and C in a logarithmic scale from 0.00001 to 0.001 and 1 to 1,000, respectively. On the training partition, for each sub-dataset, the mean cross-validation accuracies of the classifier for each combination of the specified parameter value was calculated from each fold and depicted as heatmap as a function of C and gamma. The parameter combination reaching the highest validation accuracy for the corresponding sub-dataset was chosen for the classification task on the test dataset.

Feature selection
To select the reduced feature set (RFS) of optimal features with superior discriminative power from the full feature set (FFS), a recursive feature elimination with cross-validation (RFECV) was performed on each of the sub-datasets. Thereby, each individual feature was ranked and the best set of features according to the classification accuracy was selected. This selection process initially included all 47 features of the dataset and then gradually removed with each iteration of those features, which contributed least to improve the classifier performance. The feature ranking was generated with regard to the number of iterations when the corresponding feature was removed and an optimal number of features was determined [24]. Subsequently, the three previously defined data subsets in the training and validation partition were reduced to the RFECV obtained optimal features, and the SVM classifier was trained and tested again on the RFS applying the same preprocessing steps and hyperparameter tuning as for the FFS.

Statistical analysis
Normally distributed data are reported as means with standard deviations otherwise as median and interquartile range (IQR). Normal distribution was assessed by using the Kolmogorov-Smirnov test. A one-way analysis of variance was performed for comparison of all texture features among malignant lesions, benign solid lesions and fat and fibroglandular tissue with post hoc Bonferroni correction (only p values less than 0.0001 were considered significant). Unpaired t test was used to compare all texture features between lesions (benign and malignant) versus normal tissue (fibroglandular and fat tissue). The receiver operating characteristic (ROC) curve was computed in the case of features with significant differences. The linear relationship between the different texture features in the FFS was graphically reported via a correlation matrix. For each data subset and corresponding set of features (FFS, RFS) of the validation partition, the overall and tissue-specific performance of the SVM classifier were quantified in terms of classification accuracy and metrics of the confusion matrix [25]. From the generated classification probabilities and confusion matrices, sensitivity and 1-specificity were extracted, and the area under the curve (AUC) was calculated. AUCs     Table S1).

Machine learning
Correlation matrices for each sub-dataset (lesion versus tissue and benign versus malignant) with the FFS were displayed in Additional file 1: Figure S1A and S1B, respectively, showing significant co-correlation of several features among the higher-order features in A.

Sub-dataset (i): solid lesions versus normal tissue
The validation dataset included 105 images (54 images of lesions and 51 images of normal tissue). For the classification of lesions versus normal tissue, the optimal hyperparameters for the FFS accounted 1,000 and 0.001 for C and gamma, respectively (Additional file 1: Figure S2A). Classification accuracies of 92.8% on the training set and of 93.3 % on the validation set (Table 4) were reached, with 3.8% of all images in the validation partition being falsely classified as normal tissue and 2.9% as lesion instead of normal tissue (Table 5). ROC analyses revealed an AUC of 0.96 (95% CI 0.89-0.98) for the validation set (Fig. 2). After training and validating, the SVM classifier on the FFS, a recursive feature elimination with cross-validation, was performed determining 14 features (Fig. 4a) as optimal features, composing the RFS. For the RFS, a correlation matrix was generated and the optimal hyperparameters were determined as C = 1,000 and gamma = 0.00001 (Additional file 1: Figures S1C and S2B).
Training and validation accuracies were 91.3% and 94.4%, respectively, with 1.9% of all images being falsely classified as lesions and 3.8% as normal tissue (Tables 4 and 5). The AUC for the RFS measured 0.98 (Fig. 2). For all showed texture feature-derived ROC curves (only features with AUC values equal or superior to 0.80) compared to the via ML- derived ROC curve, p values were < 0.05 (ranging from 0.003 to 0.02), indicating a significant difference between the areas. The two lesions incorrectly classified as normal tissue were one malignant and one benign (Fig. 5).

Sub-dataset (ii): malignant versus benign solid lesions
The validation dataset included 54 images (27 images of lesions and 27 images of normal tissue). For the classification of the malignant versus benign solid lesions, the optimal hyperparameters for the full feature set accounted 100 and 0.001 for C and gamma (Additional file 1: Figure S2C). The accuracy on the training set measured 89.0% and on the validation set 90.7% with 7.4% of all lesions being falsely classified as benign lesions and 1.9% falsely as malignant (Tables 4 and 5, Fig. 5). The AUC measured 0.98 (Fig. 3). After RFECV, a correlation matrix for the reduced feature set of 25 features (Fig. 4b) was generated applying the optimal hyperparameters of C = 1,000 and gamma = 0.001 (Additional file 1: Figures S1D and S2D). The classification accuracy for the RFS was 89.0% on the training and 87.1 % on the validation partition (Table 4). After feature reduction, the false-positive rate of malignant lesions being falsely classified as benign increased to 9.2 % and AUC decreased to 0.96 (Fig. 3). The ROC curve for entropy, derived from texture analysis, was significantly different (p = 0.003) from the via ML-derived ROC curve.

Discussion
In the current study, we demonstrated that texture feature analysis of breast imaging findings in ABUS examinations might be used to differentiate malignant and benign solid lesions as well as normal tissue of the breast with high accuracy. We also showed that ML applied to texture data might be superior compared to the statistical analysis of single texture features. Although the interrelation between the data derived from TA and potential underlying biological properties has not yet been resolved, a number of previous works have investigated the use of TA to quantify spatial heterogeneity of benign and malignant lesions in images acquired with different modalities [9][10][11][12][13][14][15][16]. A limited number of studies explored the use of TA or  ML in ultrasound imaging for characterisation of breast lesions [28][29][30]. Indeed, the application of TA in conventional B-mode imaging is hindered by variations of scanning parameters that can determine unwanted variations in the assessment of texture features. Standardised acquisitions in ABUS can in part overcome these limitations. In our study, a number of texture features exhibited significant differences when used to distinguish solid breast lesions from normal tissue as well as malignant from benign solid lesions with a relatively high AUC up to 0.86 in both cases. ML offers the possibility to train algorithms to recognise patterns of data derived from the analysis of multiple texture features instead of referring to a single feature. The use of a ML model based on the SVM algorithm with radial basis function determined an increase in the AUC to a maximum of 0.96 in the differentiation of lesions versus normal tissue as well as in the differentiation of malignant versus benign lesions with a maximal accuracy of 94.4% and 90.7%, respectively. The use of recursive feature selection in the test datasets for differentiation of lesions versus normal tissue resulted in an increase in the AUC to 0.98 whereas for malignant versus benign lesions, the AUC slightly decreased to 0.96. Moreover, application of the reduced feature sets resulted in nearly the same training accuracies for the training data and even a slightly higher accuracy of 94.4% for the test dataset differentiating lesions versus normal tissue. These excellent performances for the full as well as for the reduced feature sets and the associated low amount of overfitting emphasise the robustness and  stability of the applied ML model. In many cases, overfitting occurs when the ML algorithm is trained in a too-large extent with details and noise negatively affecting the performance on real-world data. In order to minimise overfitting, the SVM on our study was trained via cross-validation, dividing the training data into subsets of equal size, which also provided advantages with respect to the limited number of data points. In addition, the robustness can be accounted, to some extent, that special interest was put into the acquisition of balanced datasets, and no oversampling techniques were applied to synthetically generate data [31].
Previous studies reported that the use of supplemental ABUS in breast cancer screening programmes causes an increase of the recall rate [6,32]. Moreover, misinterpretation of lesions along with the presence of multiple distracting lesions are determining factors in the case of undiagnosed cancers at supplemental screening ultrasonography [33]. Although computer-aided-detection software for ABUS offers the potential to improve radiologists' performances in detecting breast cancer, characterisation of the imaging findings remains a major issue [34,35]. In a recent study, van Zelst et al. [35] showed that the AUCs between conventional ABUS reading and computer-aided-detectionbased reading performed by eight radiologists with variable years of ABUS experience was not significantly different (0.82 and 0.83, respectively). The combined use of CAD software with algorithms, that enable TA combined to ML, might overcome the relative limitations of the two approaches (i.e., the limited specificity of CAD and the necessity for aided-detection in TA combined to ML). Although the differentiation of breast lesions from normal breast tissue was quite straightforward in our cases, we decided to include also this evaluation considering the potential role of ML algorithms integrated in the software for ABUS image evaluation. A maximal accuracy of 94.4 was observed when comparing normal tissue versus breast lesions. More important, in our study, a very high specificity (maximal 96.3%) was achieved in the comparison of benign versus malignant lesions using ML.
Our study has some major limitations. First, the underpowered analysis due to the limited number of cases is included. Nevertheless, the purpose of our pilot study was to present a possible approach for the evaluation of breast imaging findings in ABUS and to enhance some differences when TA information alone or in conjunction with ML is used. A possibly prospective study including a higher number of cases is necessary to confirm our results. Second, the high number of evaluated images was derived from a relatively low number of different lesions that could have biased the results. Nevertheless, both malignant and benign solid lesions were collected from the general female population referred to our department for screening or follow-up examination of known lesions presumably forming a sufficient representative group of breast solid lesions. Third, we did not compare the performance of TA and ML with the performance of radiologists with different levels of experience, which was beyond the scope of this study. Also, although the inter-reader agreement for the assessment of the texture feature measurements was evaluated only in part of the lesions, we could demonstrate a high reproducibility of the measurements for all features.
In conclusion, our pilot study demonstrated that TA in combination with ML might represent a useful diagnostic tool in the evaluation of ABUS findings. Applying ML techniques to texture features might be superior compared to analysis of single texture features. A prospective study including a higher number of cases is necessary to confirm our results. present in particular among the higher order features in A (e.g., SRE[GLCM] and HGRE[GLCM]) as possible reflection of underlying common biological properties. Figure S2. Heatmaps depicting the optimal hyperparameters for the full feature (A, B) and the reduced feature training datasets (C, D). The hyperparameter tuning was implemented via nested grid search on the SVM classifier by specifying the parameter for gamma and (C) in a logarithmic scale from 0.00001 to 0.001 and 1 to 1000, respectively. (DOCX 4241 kb) Received: 18 April 2019 Accepted: 28 August 2019