Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem

Background Automated segmentation of anatomical structures is a crucial step in image analysis. For lung segmentation in computed tomography, a variety of approaches exists, involving sophisticated pipelines trained and validated on different datasets. However, the clinical applicability of these approaches across diseases remains limited. Methods We compared four generic deep learning approaches trained on various datasets and two readily available lung segmentation algorithms. We performed evaluation on routine imaging data with more than six different disease patterns and three published data sets. Results Using different deep learning approaches, mean Dice similarity coefficients (DSCs) on test datasets varied not over 0.02. When trained on a diverse routine dataset (n = 36), a standard approach (U-net) yields a higher DSC (0.97 ± 0.05) compared to training on public datasets such as the Lung Tissue Research Consortium (0.94 ± 0.13, p = 0.024) or Anatomy 3 (0.92 ± 0.15, p = 0.001). Trained on routine data (n = 231) covering multiple diseases, U-net compared to reference methods yields a DSC of 0.98 ± 0.03 versus 0.94 ± 0.12 (p = 0.024). Conclusions The accuracy and reliability of lung segmentation algorithms on demanding cases primarily relies on the diversity of the training data, highlighting the importance of data diversity compared to model choice. Efforts in developing new datasets and providing trained models to the public are critical. By releasing the trained model under General Public License 3.0, we aim to foster research on lung diseases by providing a readily available tool for segmentation of pathological lungs.


Background
The translation of machine learning (ML) approaches developed on specific datasets to the variety of routine clinical data is of increasing importance. As methodology matures across different fields, means to render algorithms robust for the transition from bench to bedside become critical.
With more than 79 million examinations per year (United States, 2015) [1], computed tomography (CT) constitutes an essential imaging procedure for diagnosing, screening, and monitoring pulmonary diseases. The detection and accurate segmentation of organs, such as the lung, is a crucial step [2], especially in the context of ML, for discarding confounders outside the relevant organ (e.g. respiration gear, implants, or comorbidities) [3].
Automated lung segmentation algorithms are typically developed and tested on limited datasets, covering a limited variability by predominantly containing cases without severe pathology [4] or cases with a single class of disease [5]. Such specific cohort datasets are highly relevant in their respective domain but lead to specialised methods and ML models that struggle to generalise to unseen cohorts when utilised for the task of segmentation. As a consequence, image processing studies, especially when dealing with routine data, still rely on semiautomatic segmentations or human inspection of automated organ masks [6,7]. However, for large-scale data analysis based on thousands of cases, human inspection or any human interaction with single data items, at all, is not feasible. At the same time, disease-specific models are limited with respect to their applicability on undiagnosed cases such as in computer-aided diagnosis or diverse cross-sectional data.
A diverse range of lung segmentation techniques for CT images has been proposed. They can be categorised into rule-based [8][9][10][11], atlas-based [12][13][14], ML-based [15][16][17][18][19], and hybrid approaches [20][21][22][23][24]. The lung appears as a low-density but high-contrast region on an x-ray-based image, such as CT, so that thresholding and atlas segmentation methods lead to good results in cases with only mild or low-density pathologies such as emphysema [8][9][10]. However, disease-associated lung patterns, such as effusion, atelectasis, consolidation, fibrosis, or pneumonia, lead to dense areas in the lung field that impede such approaches. Multi-atlas registration and hybrid techniques aim to deal with these highdensity abnormalities by incorporating additional atlases, shape models, and other post-processing steps [22,25]. However, such complex pipelines are not reproducible without extensive effort if the source code and the underlying set of atlases are not shared. Conversely, trained ML models have the advantage of being easily shared without giving access to the training data. In addition, they are fast at inference time and scale well when additional training data are available. Harrison et al. [19] showed that deep learning-based segmentation outperforms a specialised approach in cases with interstitial lung diseases [19] and provides trained models. However, with some exceptions, trained models for lung segmentation are rarely shared publicly, hampering advances in research. At the same time, ML methods are limited by the training data available, their number, and the quality of the ground truth annotations.
Benchmark datasets for training and evaluation are paramount to establish comparability between different methods. However, publicly available datasets with manually annotated organs for the development and testing of lung segmentation algorithms are scarce. The VISCERAL Anatomy3 dataset [4], Lung CT Segmentation Challenge 2017 (LCTSC) [5], and the VESsel SEgmentation in the Lung 2012 Challenge (VESSEL12) [26] provide publicly available lung segmentation data. Yet, these datasets were not published for the purpose of lung segmentation and are strongly biased to either inconspicuous cases or specific diseases neglecting comorbidities and the wide spectrum of physiological and pathological phenotypes. The LObe and Lung Analysis 2011 (LOLA11) challenge published a diverse set of scans for which the ground truth labels are known only to the challenge organisers [27].
Here, we addressed the following questions: (1) what is the influence of training data diversity on lung segmentation performance; (2) how do inconsistencies in ground truth annotations across data contribute to the bias in automatic segmentation or its evaluation in severely diseased cases; and (3) can a generic deep learning algorithm perform competitively with readily available systems on a wide range of data, once diverse training data are available?

Methods
We trained four generic semantic segmentation models from scratch on three different public training sets and one training set collected from the clinical routine. We evaluated these models on public test sets and routine data, including cases showing severe pathologies. Furthermore, we performed a comparison of models trained on a diverse routine training set to two published automatic lung segmentation systems, which we did not train, but used as provided. An overview of training and testing performed is given in Fig. 1.

Routine data extraction
The local ethics committee of the Medical University of Vienna approved the retrospective analysis of the imaging data. We collected representative training and evaluation datasets from the picture archiving and communication system of a university hospital radiology department. We included inpatients and outpatients who underwent a chest CT examination during a period of 2.5 years, with no restriction on age, sex, or indication. However, we applied minimal inclusion criteria with regard to imaging parameters, such as primary and original DICOM tag, number of slices in a series ≥ 100, sharp convolution kernel, and series description included one of the terms lung, chest, or thorax. If multiple series of a study fulfilled these criteria, the one series with the highest number of slices was used assuming lower inter-slice distance or larger field of view. Scans which did not or only partially showed the lung or scans with patients in lateral position were disregarded. In total, we collected more than 5,300 patients (examined during the 2.5-year period), each represented by a single CT series.
In addition, we carefully selected a large representative training dataset from the clinical routine using three sampling strategies: (1) random sampling of cases (n = 57), (2) sampling from image phenotypes [28] (n = 71) (the exact methodology for phenotype identification was not in the scope of this work), and (3) manual selection of edge cases with severe pathologies, such as fibrosis (n = 28), trauma (n = 20), and other cases showing extensive ground-glass opacity, consolidations, fibrotic patterns, tumours, and effusions (n = 55). In total, we selected 231 cases from routine data for training (hereafter referred to as R-231). Besides biology, technical acquisition parameters are an additional source of appearance variability. The R-231 dataset contains scans acquired with 22 different combinations of scanner manufacturer, convolution kernel, and slice thickness. While the dataset collected from the clinical routine showed a high variability in lung appearance, cases that depict the head or the abdominal area are scarce. To mitigate this bias toward slices that showed the lung, we augmented the number of non-lung slices in R-231 by including all slices which did not show the lung from the Anatomy3 dataset. Table 1 lists the training data collected.

Test datasets
For testing, we randomly sampled 20 cases from the routine database that were not part of the training set and 15 cases with specific anomalies: atelectasis (n = 2), emphysema (n = 2), fibrosis (n = 4), mass (n = 2), pneumothorax (n = 2), and trauma (n = 3). In addition, we tested on cases from the public LTRC, LCTSC, and VESSEL12 datasets, which were not used for training. Table 2 lists the test data collected. Further, we calculated results on a combined dataset composed of the individual test sets (All(L), n = 191). In addition, we report all test cases combined without the LTRC and LCTSC data considered (All, n = 62). The rationale behind this is that the LTRC test dataset contains 105 volumes and dominates the average scores, and the LCTSC dataset contains multiple cases with tumours and effusions that are not included in the ground truth masks (Fig. 3). Thus, an automated segmentation that includes these areas yields a lower score, distorting and misrepresenting the combined results.

Ground truth annotations
Ground truth labelling on the routine data was bootstrapped by training of a lung segmentation algorithm (U-net) on the Anatomy3 dataset. The preliminary masks were iteratively corrected by four readers: two Fig. 1 Schematic overview of the training and testing performed. We collected public datasets and two datasets from the routine. We used these datasets to train four generic semantic segmentation models and tested the trained models on public and routine data together with readily available lung segmentation systems radiologists with 4 and 5 years of experience in chest CT and two medical image analysis experts with 6 and 2 years of experience in processing chest CT scans. The model for the intermediate masks was iteratively retrained after 20-30 new manual corrections were performed using the ITK-Snap software [29].

Segmentation methods
We refrained from developing specialised methodology but utilised generic state-of-the-art deep learning, semantic segmentation architectures that were not specifically proposed for lung segmentation. We trained these "vanilla" models without modifications and without pre-training on other data. We considered the following four generic semantic segmentation models: U-net, ResU-net, Dilated Residual Network-D-22, and Deeplab v3+.

U-net
Ronneberger et al. [30] proposed the U-net for the segmentation of anatomic structures in microscopy images. Since then, it has been used for a wide range of segmentation tasks and various modified versions have been studied [31,32]. We utilised the U-net with the only adaption being batch normalisation [33] after each layer.

ResU-net
Residual connections have been proposed to facilitate the learning of deeper networks [34,35]. The ResU-net model includes residual connections at every down-and up-sampling block as a second adaptation to the U-net, in addition to batch normalisation.

Dilated Residual Network-D-22
Yu and Koltun [36] proposed dilated convolutions for semantic image segmentation and adapted deep residual networks [35] with dilated convolutions to perform semantic segmentations on natural images. Here, we utilised the Dilated Residual Network-D-22 model, as proposed by Yu et al. [37].

Deeplab v3+
Deeplab v3 combines dilated convolutions, multi-scale image representations, and fully connected conditional The number of volumes, the number of slices that showed the lung (slices-L), and the total number of slices (#Slices) are listed. LTRC, LCTSC, and VESS12 are cases from the respective public dataset that were not used for training *Two cases from the publicly available Lung1 dataset **Four cases from the publicly available Visceral Anatomy 3 dataset The number of volumes, the number of slices that showed the lung (slices-L), and the total number of slices are listed. Visceral, LTRC, and LCTSC are public datasets; R-36 and R-231 are images from the routine database of a radiology department random fields as a post-processing step. Deeplab v3+ includes an additional decoder module to refine the segmentation. Here, we utilised the Deeplab v3+ model as proposed by Chen et al. [38]. We compared the trained models to two readily available reference methods: the Progressive Holistically Nested Networks (P-HNN) and the Chest Imaging Platform (CIP). The P-HNN has been proposed by Harrison et al. [19] for lung segmentation. The upon request available model was trained on cases from the public LTRC dataset (618 cases) and other cases with interstitial lung diseases or infectious diseases (125 cases). The CIP provides an open-source lung segmentation tool based on thresholding and morphological operations [39].

Experiments
We determined the influence of training data variability (especially public datasets versus routine) on the generalizability to other public test datasets, and, specifically, to cases with a variety of pathologies. To establish comparability, we limited the number of volumes and slices to match the smallest dataset from LCTSC, with 36 volumes and 3,393 slices. During this experiment, we considered only slices that showed the lung (during training and testing) to prevent a bias induced by the field of view. For example, images in VISCERAL Anatomy 3 showed either the whole body or the trunk, including the abdomen, while other datasets, such as LTRC, LCTSC, or VESSEL12, contained only images limited to the chest.
Further, we compared the generic models trained on the R-231 dataset to the publicly available systems CIP and P-HNN. For this comparison, we processed the full volumes. The CIP algorithm was shown to be sensitive to image noise. Thus, if the CIP algorithm failed, we pre-processed the volumes with a Gaussian filter kernel. If the algorithm still failed, the case was excluded for comparison. The trained P-HNN model does not distinguish between the left and right lung. Thus, evaluation metrics were computed on the full lung for masks created by P-HNN. In addition to evaluation on publicly available datasets and methods, we performed an independent evaluation of our lung segmentation model by submitting solutions to the LOLA11 challenge for which 55 CT scans are published but ground truth masks are available only to the challenge organisers. Prior research and earlier submissions suggest inconsistencies in the ground truth of the LOLA11 dataset, especially with respect to pleural effusions [24]. We specifically included effusions in our training datasets. To account for this discrepancy and improve comparability, we submitted two solutions: first, masks as yielded by our model and alternatively, with subsequently removed dense areas from the lung masks. The automatic exclusion of dense areas was performed by simple thresholding of values between -50 < HU < 70 and morphological operations.
Studies on lung segmentation usually use overlapand surface-metrics to assess the automatically generated lung mask against the ground truth. However, segmentation metrics on the full lung can only marginally quantify the capability of a method to cover pathological areas in the lung as pathologies may be relatively small compared to the lung volume. Carcinomas are an example of high-density areas that are at risk of being excluded by threshold-or registration-based methods when they are close to the lung border. We utilised the publicly available, previously published Lung1 dataset [38] to quantify the model's ability to cover tumour areas within the lung. The collection contains scans of 318 non-small cell lung cancer patients before treatment, with a manual delineation of the tumours. In this experiment, we evaluated the overlap proportion of tumour volume covered by the lung mask.

Implementation details
We aimed to achieve a maximum of flexibility with respect to the field of view (from partially visible organ to whole-body) and to enable lung segmentation without prior localisation of the organ. To this end, we performed segmentation on the slice level. That is, for volumetric scans, each slice was processed individually. We segmented the left and right lung (individually labelled), excluded the trachea, and specifically included high-density anomalies such as tumour and pleural effusions. During training and inference, the images were cropped to the body region using thresholding and morphological operations and rescaled to a resolution of 256 × 256 pixels. Prior to processing, Hounsfield units were mapped to the intensity window [-1,024; 600] and normalised to the 0-1 range. During training, the images were augmented by random rotation, non-linear deformation, and Gaussian noise. We used stratified mini-batches of size 14 holding 7 slices showing the lung and 7 slices which do not show the lung. For optimisation, we used stochastic gradient descent with momentum.

Statistical methods
Automatic segmentations were compared to the ground truth for all test datasets using the following evaluation metrics, as implemented by the Deepmind surfacedistance python module [40]. While segmentation was performed on two-dimensional slices, evaluation was performed on the three-dimensional volumes. If not reported differently, the metrics were calculated for the where X and Y are two alternative labellings, such as predicted and ground truth lung masks.
Robust Hausdorff distance (HD95). The directed Hausdorff distance is the maximum distance over all distances from points in surface X s to their closest point in surface Y s . In mathematical terms, the directed robust Hausdorff distance is given as: where P 95 denotes the 95th percentile of the distances. Here, we used the symmetric adaptation: Mean surface distance (MSD). The MSD is the average distance of all points in surface X s to their closest corresponding point in surface Y s : Here, we used the symmetric adaptation:

Results
Models trained on routine data achieve improved evaluation scores compared to models trained on publicly available study data. U-net, ResU-net, and Deeplab v3+ models, when trained on routine data (R-36), yielded the best evaluation scores on the merged test dataset (All, n = 62). The U-net yields mean DSC, HD95, and MSD scores of 0.96 ± 0.08, 9.19 ± 18.15, and 1.43 ± 2.26 when trained  0.174, 0.112). This advantage of routine data for training is also reflected in results using other combinations of model architecture and training data. Table 3 lists the evaluation results in detail. We determined that the influence of model architecture is marginal compared to the influence of training data. Specifically, the mean DSC does not vary for more than 0.02 when the same combination of training and test set was used for different architectures (Table 3).
We created segmentations for the 55 cases of the LOLA11 challenge with the U-net(R-231) model. The unaltered masks yielded a mean overlap score of 0.968 and with dense areas removed 0.977. Table 5 and Fig. 4 show results for tumour overlap on the 318 volumes of the Lung1 dataset. U-net(R-231) covered more tumour volume mean/median compared to P-HNN (60%/69% versus 50%/44%, p < 0.001) and CIP (34%/13%). Qualitative results for tumour cases for U-net(R-231) and P-HNN are shown in Fig. 5b, c. We found that 23 cases of the Lung1 dataset had corrupted ground truth annotation of the tumours (Fig. 4d). Figure 5e shows cases with little or no tumour overlap achieved by U-net(R-231).

Discussion
We showed that training data, sampled from the clinical routine, improves generalizability to a wide spectrum of pathologies compared to public datasets. We assume this lies in the fact that many publicly available datasets do not include dense pathologies such as severe fibrosis, tumour, or effusions as part of the lung segmentation. Further, they are often provided without guarantees about segmentation quality and consistency. While the Anatomy3 dataset underwent a thorough quality assessment, the organisers of the VESSEL12 dataset merely provided lung segmentations as a courtesy supplement for the task of vessel segmentation, and within the LCTSC dataset, "tumour is excluded in most data" and "collapsed lung may be excluded in some scans" [5]. A comparison to the segmentation algorithm of the chest imaging platform (CIP) and the trained P-HNN model is given. The results are expressed in mean and mean ± standard deviation for the Dice similarity coefficient (DSC), Robust Hausdorff distance (HD95), and mean surface distance (MSD) *The LCTSC ground truth masks do not include high-density diseases, and the high number of LTRC test cases dominates the averaged results. Thus, "All(L)" (n = 167) is the mean over all cases that included LCTSC and LTRC, while "All" (n = 62) does not include the LCTSC and LTRC cases **For these rows, only cases on which the CIP algorithm did not fail, and where the DSC was larger than 0 were considered (#Cases). For abbreviations, see Tables 1 and 2 Hofmanninger et al. European Radiology Experimental (2020) 4:50 Results indicate that both, size and diversity of the training data, are relevant. State-of-the-art results can be achieved with images from only 36 patients which is in line with previous works [41] achieving a mean DSC of 0.99 on LTRC test data using the U-net(R-36) model.
A large number of segmentation methods are proposed every year, often based on architectural modifications [32] of established models. Isensee et al. [32] showed that such modified design concepts do not improve, and occasionally even worsen, the performance of a well-designed baseline. They achieved state-of-the-art performance on multiple, publicly available segmentation challenges relying only on Unets. This corresponds to our finding that architectural choice had a subordinate effect on performance.
At the time of submission, the U-net(R-231) achieved the second-highest score among all competitors in the LOLA11 challenge. In comparison, the first ranked method [22] achieved a score of 0.980 and a human reference segmentation achieved 0.984 [27]. Correspondingly, the U-net(R-231) model achieved improved evaluation measures (DSC, HD95, MSD, and tumour overlap) compared to two public algorithms.
There are limitations of our study that should be taken into account. Routine clinical data vary between sites. Thus, extraction of a diverse training dataset from clinical routine may only be an option for centres that are exposed to a wide range of patient variety. Evaluation results based on public datasets are not fully comparable. For example, the models trained on routine data compared to other datasets yielded lower performance in terms of DSC on the LCTSC test data. However, the lower scores for models trained on routine data in LCTSC can be attributed to the lack of very-dense pathologies in the ground truth masks. Figure 3 illustrates cases for which the R-231 model yielded low DSC. The inclusion or exclusion of pathologies such as effusions into lung segmentations is a matter of definition and application. While pleural effusions (and pneumothorax) are technically outside the lung, they are assessed as part of lung assessment and have a substantial impact on lung parenchyma appearance through compression artefacts. Neglecting such abnormalities would hamper automated lung assessment, as they are closely linked to lung function. In addition, lung masks that include pleural effusions greatly alleviate the task of effusion detection and quantification, thus making it possible to remove effusions from the lung segmentation as a postprocessing step. We proposed a general lung segmentation algorithm relevant for automated tasks in which the diagnosis is not known beforehand. However, specialised algorithms for specific diseases could be beneficial in scenarios of analysing cohorts, for which the disease is already known.
In conclusion, we showed that accurate lung segmentation does not require complex methodology and that a proven deep-learning-based segmentation architecture yields state-of-the-art results once diverse (but not necessarily larger) training data are available. By comparing various datasets for training of the models, we illustrated the importance of training data diversity and showed that data from clinical routine can generalise well to unseen cohorts, highlighting the need for public datasets specifically curated for the task of lung segmentation. We draw the following conclusions: (1) translating ML approaches from bench to bedside can require the collection of diverse training data rather than methodological modifications; (2) current, publicly available study datasets do not meet these diversity requirements; and (3) generic, semantic, segmentation algorithms are adequate for the task of lung segmentation. A reliable, universal tool for lung segmentation is fundamentally important to foster research on severe lung diseases and to study routine clinical datasets. Thus, the trained model and inference code are made publicly available under the GPL-3.0 license to serve as an open science tool for research and development and as a publicly available baseline for lung segmentation under https://github. com/JoHof/lungmask.