Prediction of lipomatous soft tissue malignancy on MRI: comparison between machine learning applied to radiomics and deep learning
European Radiology Experimental volume 6, Article number: 41 (2022)
Malignancy of lipomatous soft-tissue tumours diagnosis is suspected on magnetic resonance imaging (MRI) and requires a biopsy. The aim of this study is to compare the performances of MRI radiomic machine learning (ML) analysis with deep learning (DL) to predict malignancy in patients with lipomas oratypical lipomatous tumours.
Cohort include 145 patients affected by lipomatous soft tissue tumours with histology and fat-suppressed gadolinium contrast-enhanced T1-weighted MRI pulse sequence. Images were collected between 2010 and 2019 over 78 centres with non-uniform protocols (three different magnetic field strengths (1.0, 1.5 and 3.0 T) on 16 MR systems commercialised by four vendors (General Electric, Siemens, Philips, Toshiba)).
Two approaches have been compared: (i) ML from radiomic features with and without batch correction; and (ii) DL from images. Performances were assessed using 10 cross-validation folds from a test set and next in external validation data.
The best DL model was obtained using ResNet50 (resulting into an area under the curve (AUC) of 0.87 ± 0.11 (95% CI 0.65−1). For ML/radiomics, performances reached AUCs equal to 0.83 ± 0.12 (95% CI 0.59−1) and 0.99 ± 0.02 (95% CI 0.95−1) on test cohort using gradient boosting without and with batch effect correction, respectively. On the external cohort, the AUC of the gradient boosting model was equal to 0.80 and for an optimised decision threshold sensitivity and specificity were equal to 100% and 32% respectively.
In this context of limited observations, batch-effect corrected ML/radiomics approaches outperformed DL-based models.
Machine learning (ML) applied to magnetic resonance imaging (MRI) radiomics could help to characterise malignancy of lipomatous soft tissue tumours.
ML/radiomics analysis outperformed DL for the benign/malignant differentiation of lipomatous soft tissue tumours on MRI in a data-limited context.
Statistical harmonisation using batch effect correction (ComBat method) improved performances when heterogeneous, multicentre data are used.
Lipomatous soft tissue tumours are a very common neoplasm stemming from fat cells . These tumours are divided into several subgroups, but most of them being benign and referred as lipoma, while rare malignant tumours are referred as liposarcomas . In practice, lipoma and high-grade liposarcoma are easily distinguishable using magnetic resonance imaging (MRI) [2, 3]. Unfortunately, some low-grade liposarcomas subtype called atypical lipomatous tumours (ALTs) representing about 40 to 45% of liposarcomas have overlapping MRI characteristic and are highly similar to lipomas [2,3,4,5]. The differential diagnosis between lipomas and ALTs is essential for therapeutic strategy and is based on histology after tissue biopsy. Lipomas are removed by marginal excision if it provides discomfort or pain to the patient, while liposarcomas must be removed by wide margin resection [1, 2]. However, taking into account the time-consuming, financia and invasive burden of biopsy, there is a medical need for providing non-invasive methods. In addition, benign mesenchymal tumours outnumber liposarcomas by a factor of at least 100, most of these biopsies could be avoided.
Radiomics is a recent field of medical imaging analysis in cancer [5, 6]. It consists to convert medical images into mineable and high-dimensional quantitative data (referred as radiomics) using mathematic descriptors. Then, radiomics are used to train machine learning (ML) algorithms to predict an outcome such as malignancy . In parallel, deep learning (DL), based on the use of convolutional neural networks (CNNs), is emerging as a promising field due to its capacity for image classification . However, CNNs often required training on a huge dataset to be accurate.
The aim of this study is to compare MRI radiomics/ML analysis with DL to predict malignancy in patients with lipomatous soft tissue tumours (ALTs versus lipomas).
Our institutional review board approved this retrospective study and the requirement to obtain informed consent was waived. The training set was extracted from a labelled database of the radiology department of comprehensive cancer centre Léon Berard. This database is recording patients with lipomatous soft tissue tumours whose histology and fat-suppressed gadolinium contrast-enhanced T1-weighted MRI sequences were available. From December 2010 to January 2018, a total of 85 patients were included (40 with lipomas and 45 with ALTs). Images were collected from 43 different centres with non-uniform protocols and centralised in the Picture Archiving and Communication System of our institution. Acquisitions were performed at three different magnetic field strengths (1.0, 1.5 and 3.0 T) on 16 MR systems commercialised by four vendors (General Electric, Siemens, Philips, Toshiba).
The validation cohort was extracted from a labelled database of the radiology department of CHU Cochin including patients, from July 2012 to July 2019 with lipomatous soft tissue tumours whose histology and MRI scans were available. This cohort included 60 patients (28 with lipomas and 32 with ALTs) with a fat-suppressed gadolinium contrast-enhanced T1-weighted pulse sequence. Images were collected from 35 different centres with non-uniform protocols and centralised in the Picture Archiving and Communication System of our institution. Acquisitions were performed at two different fields (1.5 and 3.0 T) on fifteen MRI systems commercialised by four vendors. For both training cohort and external validation cohort the most commonly used contrast agent was the Dotarem (Guerbet, Villepinte, France) with a dose of 0.2 mL.kg−1. Population characteristics for both training and validation set are provided in Table 1.
Images were automatically loaded in in-house software developed on Matlab R2019a (The MathWorks, Natick, USA). The tumour was manually segmented in three dimensions, slice-by-slice, by an experienced radiographer with a 19-year experience in MRI and segmentations were reviewed by a radiologist with a 13-year experiences in MRI, using the fat-suppressed gadolinium contrast-enhanced T1-weighted acquisition.
Data were labelled as malignant or benign based on histopathology using Murine Double Minute 2 (MDM2) gene amplification by fluorescence in situ hybridisation (FISH).
Radiomics feature extraction
Radiomics features included size, contour and region-based shape features, intensity distribution (or global low-order texture) features, image domain high-order texture features and spatial-frequency textures features. Size and shape features were directly extracted from the binary masks. It included region and edges-based conventional metric. Intensity distribution features were extracted from masked MR images without normalisation or filtering of voxel intensities and from the histogram built with 256 bins. Before the extraction of texture features, voxels were resampled to be isotropic using an affine transformation and a nearest-neighbour interpolation and discretised in a smaller number of grey levels. This operation was performed using an equal probability algorithm to define decision thresholds in the volume such as the number of voxels for a given reconstructed level is the same in the quantised volume for all grey levels. Images were discretised in 8, 16, 24, 32, 40, 48 and 64 grey levels and for each level four matrix were built: grey-level co-occurrence matrix (n = 21); grey-level run length matrix (n = 13); grey-level size zone matrix) (n = 13); and neighbourhood grey tone difference matrix (n = 5). From them, characteristics were extracted.
Frequency domain-based texture features were extracted from the Gabor filters responses. Grey-level co-occurrence matrix and grey-level run length matrix were computed for four directions (0°, 45°, 90° and 135°) with an offset of 1 pixel. For grey-level size zone matrix and neighbourhood grey tone difference matrix, a 26-pixel connectivity were used. For Gabor filtering, 5 scales, 6 orientations and a minimal wavelength of 3 were used. Radiomic features computation was achieved according to the image biomarker standardisation initiative, IBSI .
Overall, 92 radiomics features were extracted.
Deep learning on images
MR images were preprocessed with N4 Bias Field Correction  algorithm to correct low frequency intensity. Then, the intensities were normalised were normalised using the Z-score such as Inew = (I−μ)/σ where μ and σ are the mean and standard deviation of the intensities. Regarding the low number of samples and the complexity of images acquired on different body region, we choose to focus on a classification model only based on the tumour. MRI slices were cropped around the tumours with respect to the masks and resized to a unique matrix size (224 × 224 pixels).
We compared three CNN-based approaches: (i) a custom CNN learned from scratch (global architecture is described in Additional file 1 below); (ii) the fine-tuning of a pretrained ResNet model; and finally (iii) an XGBoost classifier based on a CNN feature extraction. Python Keras API with TensorFlow  backend was used to implement the different CNNs. First, we create a CNN from scratch with a simple architecture containing three blocks including a two-dimensional convolution, a batch normalisation, a ReLU activation, a max pooling and a dropout. After the three blocks, the tensor was flattened and followed by a fully connected layer of 32 units, activated by ReLU. A final dropout was placed before the last layer, composed of a single neuron activated by the sigmoid function to output the probability of malignancy. To augment the size of the dataset, we applied some small transformations on the images (flipped, zoomed, rotated and shifted).
Second, we used transfer learning starting from a ResNet50 model pretrained on ImageNet . The last layers specific to the classification on ImageNet were removed to add a two-dimensional global average pooling giving a flat shape of 2048 features, followed by one or more blocks composed of a fully connected layer, a batch normalisation, a ReLU activation and a dropout. Since the three-dimensional dataset contained more malignant slices than benign ones, we added a class weight when fitting the model, to give more importance to each benign observation in the loss function.
As before, the final layer was a single unit, activated by the sigmoid function. We fine-tuned the model by freezing the pretrained part of the network such that only our new top layers could update their weights and biases. The network was trained this way during a few epochs. Then, the last block of the pre-trained part was unfrozen, and trained with a small learning rate. Importantly, images were preprocessed to fit the ResNet50 requirement.
We tested this protocol with training set only tested on external validation cohort and by merging all our data (training and external validation cohort set) over cross-validation (see further in “Models evaluation and statistical analysis” section). Third, we used the ResNet50 to extract features from images, and used these features as inputs to train an XGBoost (eXtreme Gradient Boosting) classifier.
Classifier on radiomic data
Since images were acquired on multisite with different MRI acquisition protocols, a stage of harmonisation is necessary to remove the batch effect introduced by technical heterogeneity on radiomic data. Therefore, we apply the ComBat algorithm, a popular batch effect correction tool . Fat signal suppression technique (fat-water decomposition versus fat saturation) having a visible impact on images and being a common source of acquisition protocol difference in clinical routine, we choose this criterion for the batch effect correction.
Four different classifiers from Python Scikit-learn  were optimised and evaluated: logistic regression (LR), support vector machine (SVM), random forest (RF) and gradient boosting (GB). These classifiers were trained from radiomics with and without batch effect correction for comparison purpose. Each model was fine-tuned with the best hyperparameters for each dataset. For SVM and LR, a preprocessing step was learned on the training set to normalise the features to have zero mean and unit variance. For RF and GB classifier, no standardisation was applied on the features, as it has no effect on decision trees.
Models evaluation and statistical analysis
Classifiers performances were compared using k-folds cross validation (k = 10) on both radiomic and images data. Mean and standard deviation of the area under the curve (AUC) at receiver operating characteristics analysis, sensitivity, specificity, were computed over the 10-folds from the test set (from the training data). Then, we inferred this model on the external validation dataset. For the deep learning approach, multiples slices from the same patient remained in identical fold so that the network could not be learned and tested on two different slices coming from the same patient.
Comparison of model diagnosis performances was achieved by comparing the AUCs from the validation set using the DeLong’s test . Comparisons were done: (i) between the radiomics models data (LR, SVM, RF and GB) trained from harmonised and non-harmonised data; and (ii) between ResNet50 and radiomics model trained from harmonised data.
Sensitivity and specificity comparisons on the training cohort were performed using χ2 and McNemar test over the 10 cross-validation folds. A p value lower than 0.05 was considered as significant.
CNN learned from scratch did not succeed to generalise on the test set and result to poor diagnosis performances (AUC 0.53 ± 0.09, mean ± standard deviation). We obtained an AUC of 0.80 ± 0.11 for ResNet50 and of 0.78 ± 0.13 for XGboost trained with CNN features, respectively. Best performances were obtained from batch-corrected radiomic data (AUC 0.99 ± 0.02) compared to non-corrected data with a GB model (AUC 0.83 ± 0.12). Detailed results are provided in Table 2.
External validation cohort
We tested all previously trained models on the external validation cohort.
We noticed a decrease performance of ResNet model on validation cohort compared to test cohort used during training (from AUC = 0.80 ± 0.11 to AUC = 0.64 respectively). We did not obtain better performance (AUC 0.74 ± 0.12, sensitivity 80%, specificity 53%) by adding patients from validation cohort in the training set (Table 3).
For the radiomic approach models trained with batch-corrected data globally resulted in a better performance than those trained with non-corrected data on external validation cohort, but no statistical differences have been found. Our best model obtained an AUC of 0.80 with a high sensitivity of 97% and a specificity of 61% (Fig. 1, Table 2). In spite of specificity loss, we optimised the GB decision threshold to increase sensitivity. We reach 100% of sensitivity and 32% of specificity by decreasing standard decision threshold from 0.5 to 0.1 (Fig. 2 and Table 4).
Examples of MRI from patients obtaining true negative, false positive and true positive with gradient boosting classifier trained on combat-harmonised radiomics are show in Fig. 3.
AUC and metrics comparisons
No significant differences were found between radiomics model trained from harmonised and non-harmonised data. DeLong’s test p values were equal to 1.0, 0.33, 0.96 and 0.22 for the LR, SVM, RF and GB models respectively. However, the “harmonised” GB model had a better sensitivity and specificity compared to non-harmonised counterpart. We selected the GB model trained on harmonised performance for decision threshold optimisation. However, we did not find significant differences between harmonised and non-harmonised radiomic trained GB models for sensitivity and specificity using χ2 (p = 0.550 and p = 0.414 respectively) over the 10 cross-validation folds during training with test cohort. Significant differences were found between ResNet50 model and between radiomics model trained from harmonised data (p < 0.001 for all).
In this study, we have shown that ML from MRI radiomics could be relevant to classify patient with lipoma or ALTs and therefore to potentially reduce the number of biopsies. The results also demonstrate the need to correct radiomics data for batch effect linked to heterogeneity in the MRI acquisition protocol. In our context of limited observations, batch corrected radiomic-based models outperformed the CNN approaches.
Using radiomic features, and traditional ML classifiers, we obtained a sensitivity of 100% and a specificity of 32% on an external validation cohort. It indicates that 32% of biopsies could be avoided for negative patient.
This work need to be confirmed on a larger study cohort. As previously demonstrated [15, 16], our results suggest that batch correction on radiomic data using ComBat method is useful with heterogeneous data, due to variability in MRI acquisition protocols from different imaging departments and hardware capabilities are used. Another work reported similar performances to diagnose well-differentiated lipomatous tumours: from radiomics derived from unenhanced T1- and T2-weighted MRI sequences, Vos et al.  obtained an AUROC equal to 0.89, however these results were not validated in an external cohort. From T1-weighted MRI radiomics, Malinauskaite et al.  obtained higher performances (AUC 0.926) but the volume of data was small (n = 38) and no validation on external validation set was proposed. Pressney et al.  have proposed a composite score (AUC 0.80) built from a multivariate analysis combining qualitative imaging features and texture features derived from T-weighted and proton density-weighted images However, no cross-validation techniques was performed and the size of data was relatively small (n = 60). To the best of our knowledge, the present work is the first reporting results from external validation data, a mandatory issue to identify harmonisation problem linked to acquisition protocol heterogeneities.
Using directly the images and CNNs was challenging in this context of domain generalisation , which consists of training a model on multiple source domains, and evaluate its performance on a distinct and unseen target domain. Thus, high heterogeneity in the images from various body regions made the task of generalisation difficult for the CNN.
Unlike radiomics, CNNs do not use quantitative features like tumour size as images had different zoom levels. The CNN performance might have been higher if the MRI slices were set to a unique scale, but we wanted the CNN to find other decision characteristics than the tumour size. In addition, it is more difficult to correct the batch effect with CNNs. Some way of investigation using native images harmonisation from generative adversarial networks could be envisioned in furthers works [21,22,23,24].
However, it is important to note that manual tumour segmentation may introduce inherent variability on radiomic features and constitutes a time-consuming task for the radiologist . Segmentation time depends on the number of slices and on the tumour volume. As an example, in this work, average segmentation time was located around five minutes for a range between two to ten minutes. Therefore, further works to substitute this task by automated approaches based on U-Net could be relevant . Another limitation of this study might be the choice of only fat-suppressed gadolinium contrast-enhanced T1-weighted MRI sequences to perform radiomic analysis. Since gadolinium injection increases costs, is not systematically done in routine use, radiomic analysis from unenhanced sequences need to be investigated.
To conclude, radiomic ML analysis outperformed DL-based approach to predict malignancy in lipomatous soft tissue tumours due to the possibility to a posteriori correct for acquisition heterogeneity. We probably could obtain better performance with more data as DL approaches are usually very performant but need a lot more data than ML analysis of radiomics data. In addition, it is much harder to generalise classification for tumours located on various organs, due to the high heterogeneity in the images as this is the case with soft-tissue tumours. Manual segmentation is also time-consuming and may introduce variability in radiomics. In a future, DL for classification tasks after generative adversarial networks-based image harmonisation or radiomic analysis after U-Net automated segmentation could help to overcome this issue.
Availability of data and materials
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.
Atypical lipomatous tumours
Area under the curve
Convolutional neural networks
Magnetic resonance imaging
Support vector machine
Jebastin JAS, Perry KD, Chitale DA et al (2020) Atypical lipomatous tumor/well-differentiated liposarcoma with features mimicking spindle cell lipoma. Int J of Surg 28:336–340. https://doi.org/10.1177/1066896919884648
Knebel C, Lenze U, Pohlig F et al (2017) Prognostic factors and outcome of liposarcoma patients: a retrospective evaluation over 15 years. BMC Cancer 410:1471–2407. https://doi.org/10.1186/s12885-017-3398-y
Brisson M, Kashima T, Delaney D et al (2013) MRI characteristics of lipoma and atypical lipomatous tumor/well- differentiated liposarcoma: retrospective comparison with histology and MDM2 gene amplification. Skeletal Radiol 42:635–647. https://doi.org/10.1007/s00256-012-1517-z
Leporq B, Bouhamama A, Pilleul F et al (2020) MRI-based radiomics to predict lipomatous soft tissue tumors malignancy: a pilot study. Cancer Imaging 20:78. https://doi.org/10.1186/s40644-020-00354-7
Fletcher C, Unni K, Mertens F (2002) Pathology and genetics of tumours of soft tissue and bone. iarc
Gillies RJ, Kinahan PE, Hricak H (2016) Radiomics: Images are more than pictures, they are data. Radiology 278:563–577. https://doi.org/10.1148/radiol.2015151169
Aerts HJWL, Velazquez ER, Leijenaar RTH et al (2014) Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Commun 5:1–9. https://doi.org/10.1038/ncomms5006
Lundervold AS, Lundervold A (2019) An overview of deep learning in medical imaging focusing on MRI. Z Med Phys 29:102–127
Zwanenburg A, Vallières M, Abdalah MA et al (2020) The image biomarker standardization initiative: standardized quantitative radiomics for high-throughput image-based phenotyping. Radiology 295:328–338. https://doi.org/10.1148/radiol.2020191145
Tustison NJ, Avants BB, Cook PA et al (2010) N4ITK: Improved N3 bias correction. IEEE Trans Med Imaging 29:1310–1320. https://doi.org/10.1109/TMI.2010.2046908
Martin A, Ashish A, Paul B, et al (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. arXiv preprint arXiv:1603.04467. https://doi.org/10.48550/arXiv.1603.04467
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet Classification with Deep Convolutional Neural Networks. Adv Neural Inf Process Syst. 25:1097–1105
Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118–127. https://doi.org/10.1093/biostatistics/kxj037
DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics 44:837. https://doi.org/10.2307/2531595
Orlhac F, Frouin F, Nioche C et al (2019) Validation of a method to compensate multicenter effects affecting CT radiomics. Radiology 291:53–59. https://doi.org/10.1148/radiol.2019182023
Orlhac F, Lecler A, Savatovski J et al (2020) How can we combat multicenter variability in MR radiomics? Validation of a correction procedure. Eur 31:2272–2280. https://doi.org/10.1007/s00330-020-07284-9
Vos M, Starmans MPA, Timbergen MJM et al (2019) Radiomics approach to distinguish between well differentiated liposarcomas and lipomas on MRI. BJS 106:1800–1809. https://doi.org/10.1002/bjs.11410
Malinauskaite I, Hofmeister J, Burgermeister S et al (2020) Radiomics and Machine Learning Differentiate Soft-Tissue Lipoma and Liposarcoma Better than Musculoskeletal Radiologists. Sarcoma 2020:1–9. https://doi.org/10.1155/2020/7163453
Pressney I, Khoo M, Endozo R et al (2020) Pilot study to differentiate lipoma from atypical lipomatous tumour/well-differentiated liposarcoma using MR radiomics-based texture analysis. Skeletal Radiol 49:1719–1729. https://doi.org/10.1007/s00256-020-03454-420
Wang J, Lan C, Liu C et al (2021) Generalizing to unseen domains: a survey on domain generalization. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2022.3178128
Armanious K, Jiang C, Fischer M et al (2020) MedGAN: Medical image translation using GANs. Comput. Med. Imaging Graph. 79:101684. https://doi.org/10.1016/j.compmedimag.2019.101684
Bowles C, Chen L, Guerrero R, et al (2018) GAN augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863. https://doi.org/10.48550/arXiv.1810.10863
Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.1019624. https://doi.org/10.48550/arXiv.1710.10196
Nie D, Trullo R, Lian J et al (2018) Medical image synthesis with deep convolutional adversarial networks. IEEE Trans. Biomed. Eng 65:2720–2730. https://doi.org/10.1109/TBME.2018.2814538
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer Verlag 9351:234–241
This work was performed within the framework of the SIRIC LyriCAN grant INCa_INSERM_DGOS_12563 and LABEX PRIMES (ANR-11-LABX-0063), program “Investissements d'Avenir” (ANR-11-IDEX-0007).
The authors state that this work has not received any funding.
Ethics approval and consent to participate
This retrospective study was approved by our Institutional Review Board (CPP Lyon-Sud-Est IV Centre Léon Bérard, N° IRB: IRB00010619).
Consent for publication
Guillaume Fradet and Reina Ayde were affiliated to Capgemini Engineering at the time of writing the study. They are currently no longer affiliated to this institution. Hugo Bottois and Mohamed El Harchaoui are affiliated to Capgemini Engineering. The remaining authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Fradet, G., Ayde, R., Bottois, H. et al. Prediction of lipomatous soft tissue malignancy on MRI: comparison between machine learning applied to radiomics and deep learning. Eur Radiol Exp 6, 41 (2022). https://doi.org/10.1186/s41747-022-00295-9