Skip to main content
  • Original article
  • Open access
  • Published:

Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem



Automated segmentation of anatomical structures is a crucial step in image analysis. For lung segmentation in computed tomography, a variety of approaches exists, involving sophisticated pipelines trained and validated on different datasets. However, the clinical applicability of these approaches across diseases remains limited.


We compared four generic deep learning approaches trained on various datasets and two readily available lung segmentation algorithms. We performed evaluation on routine imaging data with more than six different disease patterns and three published data sets.


Using different deep learning approaches, mean Dice similarity coefficients (DSCs) on test datasets varied not over 0.02. When trained on a diverse routine dataset (n = 36), a standard approach (U-net) yields a higher DSC (0.97 ± 0.05) compared to training on public datasets such as the Lung Tissue Research Consortium (0.94 ± 0.13, p = 0.024) or Anatomy 3 (0.92 ± 0.15, p = 0.001). Trained on routine data (n = 231) covering multiple diseases, U-net compared to reference methods yields a DSC of 0.98 ± 0.03 versus 0.94 ± 0.12 (p = 0.024).


The accuracy and reliability of lung segmentation algorithms on demanding cases primarily relies on the diversity of the training data, highlighting the importance of data diversity compared to model choice. Efforts in developing new datasets and providing trained models to the public are critical. By releasing the trained model under General Public License 3.0, we aim to foster research on lung diseases by providing a readily available tool for segmentation of pathological lungs.

Key points

  • Robust segmentation of pathological lungs can be achieved with standard methods.

  • Public datasets provide only limited diversity for training of lung segmentation algorithms on computed tomography scans.

  • Routine clinical imaging data can provide the required variability to train general models beyond disease-specific solutions.


The translation of machine learning (ML) approaches developed on specific datasets to the variety of routine clinical data is of increasing importance. As methodology matures across different fields, means to render algorithms robust for the transition from bench to bedside become critical.

With more than 79 million examinations per year (United States, 2015) [1], computed tomography (CT) constitutes an essential imaging procedure for diagnosing, screening, and monitoring pulmonary diseases. The detection and accurate segmentation of organs, such as the lung, is a crucial step [2], especially in the context of ML, for discarding confounders outside the relevant organ (e.g. respiration gear, implants, or comorbidities) [3].

Automated lung segmentation algorithms are typically developed and tested on limited datasets, covering a limited variability by predominantly containing cases without severe pathology [4] or cases with a single class of disease [5]. Such specific cohort datasets are highly relevant in their respective domain but lead to specialised methods and ML models that struggle to generalise to unseen cohorts when utilised for the task of segmentation. As a consequence, image processing studies, especially when dealing with routine data, still rely on semiautomatic segmentations or human inspection of automated organ masks [6, 7]. However, for large-scale data analysis based on thousands of cases, human inspection or any human interaction with single data items, at all, is not feasible. At the same time, disease-specific models are limited with respect to their applicability on undiagnosed cases such as in computer-aided diagnosis or diverse cross-sectional data.

A diverse range of lung segmentation techniques for CT images has been proposed. They can be categorised into rule-based [8,9,10,11], atlas-based [12,13,14], ML-based [15,16,17,18,19], and hybrid approaches [20,21,22,23,24]. The lung appears as a low-density but high-contrast region on an x-ray-based image, such as CT, so that thresholding and atlas segmentation methods lead to good results in cases with only mild or low-density pathologies such as emphysema [8,9,10]. However, disease-associated lung patterns, such as effusion, atelectasis, consolidation, fibrosis, or pneumonia, lead to dense areas in the lung field that impede such approaches. Multi-atlas registration and hybrid techniques aim to deal with these high-density abnormalities by incorporating additional atlases, shape models, and other post-processing steps [22, 25]. However, such complex pipelines are not reproducible without extensive effort if the source code and the underlying set of atlases are not shared. Conversely, trained ML models have the advantage of being easily shared without giving access to the training data. In addition, they are fast at inference time and scale well when additional training data are available. Harrison et al. [19] showed that deep learning-based segmentation outperforms a specialised approach in cases with interstitial lung diseases [19] and provides trained models. However, with some exceptions, trained models for lung segmentation are rarely shared publicly, hampering advances in research. At the same time, ML methods are limited by the training data available, their number, and the quality of the ground truth annotations.

Benchmark datasets for training and evaluation are paramount to establish comparability between different methods. However, publicly available datasets with manually annotated organs for the development and testing of lung segmentation algorithms are scarce. The VISCERAL Anatomy3 dataset [4], Lung CT Segmentation Challenge 2017 (LCTSC) [5], and the VESsel SEgmentation in the Lung 2012 Challenge (VESSEL12) [26] provide publicly available lung segmentation data. Yet, these datasets were not published for the purpose of lung segmentation and are strongly biased to either inconspicuous cases or specific diseases neglecting comorbidities and the wide spectrum of physiological and pathological phenotypes. The LObe and Lung Analysis 2011 (LOLA11) challenge published a diverse set of scans for which the ground truth labels are known only to the challenge organisers [27].

Here, we addressed the following questions: (1) what is the influence of training data diversity on lung segmentation performance; (2) how do inconsistencies in ground truth annotations across data contribute to the bias in automatic segmentation or its evaluation in severely diseased cases; and (3) can a generic deep learning algorithm perform competitively with readily available systems on a wide range of data, once diverse training data are available?


We trained four generic semantic segmentation models from scratch on three different public training sets and one training set collected from the clinical routine. We evaluated these models on public test sets and routine data, including cases showing severe pathologies. Furthermore, we performed a comparison of models trained on a diverse routine training set to two published automatic lung segmentation systems, which we did not train, but used as provided. An overview of training and testing performed is given in Fig. 1.

Fig. 1
figure 1

Schematic overview of the training and testing performed. We collected public datasets and two datasets from the routine. We used these datasets to train four generic semantic segmentation models and tested the trained models on public and routine data together with readily available lung segmentation systems

Routine data extraction

The local ethics committee of the Medical University of Vienna approved the retrospective analysis of the imaging data. We collected representative training and evaluation datasets from the picture archiving and communication system of a university hospital radiology department. We included inpatients and outpatients who underwent a chest CT examination during a period of 2.5 years, with no restriction on age, sex, or indication. However, we applied minimal inclusion criteria with regard to imaging parameters, such as primary and original DICOM tag, number of slices in a series ≥ 100, sharp convolution kernel, and series description included one of the terms lung, chest, or thorax. If multiple series of a study fulfilled these criteria, the one series with the highest number of slices was used assuming lower inter-slice distance or larger field of view. Scans which did not or only partially showed the lung or scans with patients in lateral position were disregarded. In total, we collected more than 5,300 patients (examined during the 2.5-year period), each represented by a single CT series.

Training datasets

To study training data diversity, we assembled four datasets with an equal number of patients (n = 36) and slices (n = 3,393). These individual datasets were randomly extracted from the public VISCERAL Anatomy3 (VISC-36), LTRC (LTRC-36), and LCTSC (LCTSC-36) datasets, and from the clinical routine (R-36).

In addition, we carefully selected a large representative training dataset from the clinical routine using three sampling strategies: (1) random sampling of cases (n = 57), (2) sampling from image phenotypes [28] (n = 71) (the exact methodology for phenotype identification was not in the scope of this work), and (3) manual selection of edge cases with severe pathologies, such as fibrosis (n = 28), trauma (n = 20), and other cases showing extensive ground-glass opacity, consolidations, fibrotic patterns, tumours, and effusions (n = 55). In total, we selected 231 cases from routine data for training (hereafter referred to as R-231). Besides biology, technical acquisition parameters are an additional source of appearance variability. The R-231 dataset contains scans acquired with 22 different combinations of scanner manufacturer, convolution kernel, and slice thickness. While the dataset collected from the clinical routine showed a high variability in lung appearance, cases that depict the head or the abdominal area are scarce. To mitigate this bias toward slices that showed the lung, we augmented the number of non-lung slices in R-231 by including all slices which did not show the lung from the Anatomy3 dataset. Table 1 lists the training data collected.

Table 1 Datasets used to train semantic segmentation models

Test datasets

For testing, we randomly sampled 20 cases from the routine database that were not part of the training set and 15 cases with specific anomalies: atelectasis (n = 2), emphysema (n = 2), fibrosis (n = 4), mass (n = 2), pneumothorax (n = 2), and trauma (n = 3). In addition, we tested on cases from the public LTRC, LCTSC, and VESSEL12 datasets, which were not used for training. Table 2 lists the test data collected. Further, we calculated results on a combined dataset composed of the individual test sets (All(L), n = 191). In addition, we report all test cases combined without the LTRC and LCTSC data considered (All, n = 62). The rationale behind this is that the LTRC test dataset contains 105 volumes and dominates the average scores, and the LCTSC dataset contains multiple cases with tumours and effusions that are not included in the ground truth masks (Fig. 3). Thus, an automated segmentation that includes these areas yields a lower score, distorting and misrepresenting the combined results.

Table 2 Test datasets used to evaluate the performance of lung segmentation algorithms

Ground truth annotations

Ground truth labelling on the routine data was bootstrapped by training of a lung segmentation algorithm (U-net) on the Anatomy3 dataset. The preliminary masks were iteratively corrected by four readers: two radiologists with 4 and 5 years of experience in chest CT and two medical image analysis experts with 6 and 2 years of experience in processing chest CT scans. The model for the intermediate masks was iteratively retrained after 20–30 new manual corrections were performed using the ITK-Snap software [29].

Segmentation methods

We refrained from developing specialised methodology but utilised generic state-of-the-art deep learning, semantic segmentation architectures that were not specifically proposed for lung segmentation. We trained these “vanilla” models without modifications and without pre-training on other data. We considered the following four generic semantic segmentation models: U-net, ResU-net, Dilated Residual Network-D-22, and Deeplab v3+.


Ronneberger et al. [30] proposed the U-net for the segmentation of anatomic structures in microscopy images. Since then, it has been used for a wide range of segmentation tasks and various modified versions have been studied [31, 32]. We utilised the U-net with the only adaption being batch normalisation [33] after each layer.


Residual connections have been proposed to facilitate the learning of deeper networks [34, 35]. The ResU-net model includes residual connections at every down- and up-sampling block as a second adaptation to the U-net, in addition to batch normalisation.

Dilated Residual Network-D-22

Yu and Koltun [36] proposed dilated convolutions for semantic image segmentation and adapted deep residual networks [35] with dilated convolutions to perform semantic segmentations on natural images. Here, we utilised the Dilated Residual Network-D-22 model, as proposed by Yu et al. [37].

Deeplab v3+

Deeplab v3 combines dilated convolutions, multi-scale image representations, and fully connected conditional random fields as a post-processing step. Deeplab v3+ includes an additional decoder module to refine the segmentation. Here, we utilised the Deeplab v3+ model as proposed by Chen et al. [38].

We compared the trained models to two readily available reference methods: the Progressive Holistically Nested Networks (P-HNN) and the Chest Imaging Platform (CIP). The P-HNN has been proposed by Harrison et al. [19] for lung segmentation. The upon request available model was trained on cases from the public LTRC dataset (618 cases) and other cases with interstitial lung diseases or infectious diseases (125 cases). The CIP provides an open-source lung segmentation tool based on thresholding and morphological operations [39].


We determined the influence of training data variability (especially public datasets versus routine) on the generalizability to other public test datasets, and, specifically, to cases with a variety of pathologies. To establish comparability, we limited the number of volumes and slices to match the smallest dataset from LCTSC, with 36 volumes and 3,393 slices. During this experiment, we considered only slices that showed the lung (during training and testing) to prevent a bias induced by the field of view. For example, images in VISCERAL Anatomy 3 showed either the whole body or the trunk, including the abdomen, while other datasets, such as LTRC, LCTSC, or VESSEL12, contained only images limited to the chest.

Further, we compared the generic models trained on the R-231 dataset to the publicly available systems CIP and P-HNN. For this comparison, we processed the full volumes. The CIP algorithm was shown to be sensitive to image noise. Thus, if the CIP algorithm failed, we pre-processed the volumes with a Gaussian filter kernel. If the algorithm still failed, the case was excluded for comparison. The trained P-HNN model does not distinguish between the left and right lung. Thus, evaluation metrics were computed on the full lung for masks created by P-HNN. In addition to evaluation on publicly available datasets and methods, we performed an independent evaluation of our lung segmentation model by submitting solutions to the LOLA11 challenge for which 55 CT scans are published but ground truth masks are available only to the challenge organisers. Prior research and earlier submissions suggest inconsistencies in the ground truth of the LOLA11 dataset, especially with respect to pleural effusions [24]. We specifically included effusions in our training datasets. To account for this discrepancy and improve comparability, we submitted two solutions: first, masks as yielded by our model and alternatively, with subsequently removed dense areas from the lung masks. The automatic exclusion of dense areas was performed by simple thresholding of values between -50 < HU < 70 and morphological operations.

Studies on lung segmentation usually use overlap- and surface-metrics to assess the automatically generated lung mask against the ground truth. However, segmentation metrics on the full lung can only marginally quantify the capability of a method to cover pathological areas in the lung as pathologies may be relatively small compared to the lung volume. Carcinomas are an example of high-density areas that are at risk of being excluded by threshold- or registration-based methods when they are close to the lung border. We utilised the publicly available, previously published Lung1 dataset [38] to quantify the model’s ability to cover tumour areas within the lung. The collection contains scans of 318 non-small cell lung cancer patients before treatment, with a manual delineation of the tumours. In this experiment, we evaluated the overlap proportion of tumour volume covered by the lung mask.

Implementation details

We aimed to achieve a maximum of flexibility with respect to the field of view (from partially visible organ to whole-body) and to enable lung segmentation without prior localisation of the organ. To this end, we performed segmentation on the slice level. That is, for volumetric scans, each slice was processed individually. We segmented the left and right lung (individually labelled), excluded the trachea, and specifically included high-density anomalies such as tumour and pleural effusions. During training and inference, the images were cropped to the body region using thresholding and morphological operations and rescaled to a resolution of 256 × 256 pixels. Prior to processing, Hounsfield units were mapped to the intensity window [-1,024; 600] and normalised to the 0–1 range. During training, the images were augmented by random rotation, non-linear deformation, and Gaussian noise. We used stratified mini-batches of size 14 holding 7 slices showing the lung and 7 slices which do not show the lung. For optimisation, we used stochastic gradient descent with momentum.

Statistical methods

Automatic segmentations were compared to the ground truth for all test datasets using the following evaluation metrics, as implemented by the Deepmind surfacedistance python module [40]. While segmentation was performed on two-dimensional slices, evaluation was performed on the three-dimensional volumes. If not reported differently, the metrics were calculated for the right and left lung separately and then averaged. For comparison between results, paired t tests have been performed.

Dice similarity coefficient (DSC). The DSC is a measure of overlap:

$$ D\left(X,Y\right)=\frac{2\left|X\cap Y\right|}{\left|X\right|+\left|Y\right|} $$

where X and Y are two alternative labellings, such as predicted and ground truth lung masks.

Robust Hausdorff distance (HD95). The directed Hausdorff distance is the maximum distance over all distances from points in surface Xs to their closest point in surface Ys. In mathematical terms, the directed robust Hausdorff distance is given as:

$$ \overrightarrow{H}\left({X}_s,{Y}_s\right)={P}_{95}\left(\underset{y\in {Y}_s}{\min\ }d\left(x,y\right)\right) $$

where P95 denotes the 95th percentile of the distances. Here, we used the symmetric adaptation:

$$ H\left({X}_s,{Y}_s\right)=\max \left(\overrightarrow{H}\left({X}_s,{Y}_s\right),\overrightarrow{H}\left({Y}_s,{X}_s\right)\right) $$

Mean surface distance (MSD). The MSD is the average distance of all points in surface Xs to their closest corresponding point in surface Ys:

$$ \overrightarrow{\mathrm{MSD}}\left({X}_s,{Y}_s\right)=\frac{1}{\left|X\right|}\sum \limits_{x\in {X}_s}\underset{y\in {Y}_s}{\min\ }d\left(x,y\right) $$

Here, we used the symmetric adaptation:

$$ \mathrm{MSD}\left({X}_s,{Y}_s\right)=\max \left(\overrightarrow{\mathrm{MSD}}\left({X}_s,{Y}_s\right),\overrightarrow{\mathrm{MSD}}\left({Y}_s,{X}_s\right)\right) $$


Models trained on routine data achieve improved evaluation scores compared to models trained on publicly available study data. U-net, ResU-net, and Deeplab v3+ models, when trained on routine data (R-36), yielded the best evaluation scores on the merged test dataset (All, n = 62). The U-net yields mean DSC, HD95, and MSD scores of 0.96 ± 0.08, 9.19 ± 18.15, and 1.43 ± 2.26 when trained on R-36 [U-net(R-36)]; 0.92 ± 0.14, 13.04 ± 19.04, and 2.05 ± 3.08 when trained on VISC-36 (R-36 versus VISC-36, p = 0.001, 0.046, 0.007); or 0.94 ± 0.13, 11.09 ± 22.9, and 2.24 ± 5.99 when trained on LTRC-36 (R-36 versus LTRC-36, p = 0.024, 0.174, 0.112). This advantage of routine data for training is also reflected in results using other combinations of model architecture and training data. Table 3 lists the evaluation results in detail.

Table 3 Evaluation results after training segmentation architectures on different training sets

We determined that the influence of model architecture is marginal compared to the influence of training data. Specifically, the mean DSC does not vary for more than 0.02 when the same combination of training and test set was used for different architectures (Table 3).

Compared to readily available trained P-HNN model, the U-net trained on the R-231 routine dataset [U-net(R-231)] yielded mean DSC, HD95, and MSD scores of 0.98 ± 0.03, 3.14 ± 7.4, 0.62 ± 0.93 versus 0.94 ± 0.12, 16.8 ± 36.57, 2.59 ± 5.96 (p = 0.024, 0.004, 0.011) on the merged test dataset (All, n = 62). For comparison with the CIP algorithm, only volumes for which the algorithm did not fail were considered. On the merged dataset (All, N = 62), the algorithms yielded mean DSC, HD95, and MSD scores of 0.98 ± 0.01, 1.44 ± 1.09, and 0.35 ± 0.19 for the U-net(R213) compared to 0.96 ± 0.05, 4.65 ± 6.45, and 0.91 ± 1.09 for CIP (p = 0.001, < 0.001, < 0.001). Detailed results are given in Table 4. Figure 2 shows qualitative results for cases from the routine test sets, and Fig. 3 shows cases for which the masks generated by the U-net(R-231) model yielded low DSCs when compared to the ground truth.

Table 4 Comparison to public systems
Fig. 2
figure 2

Segmentation results for selected cases from routine data. Each column shows a different case. Row 1 shows a slice without lung masks, row 2 shows the ground truth, and rows 3 to 5 show automatically generated lung masks. Effusion, chest tube, and consolidations (a); small effusions, ground-glass and consolidation (b); over-inflated (right) and poorly ventilated (left), atelectasis (c); irregular reticulation and traction bronchiectasis, fibrosis (d); pneumothorax (e); and effusions and compression atelectasis (trauma) (f)

Fig. 3
figure 3

Ground truth annotations in public datasets lack coverage of pathologic areas. Segmentation results for cases in public datasets where the masks generated by our U-net(R-231) yielded low Dice similarity coefficients when compared to the ground truth. Note that public datasets often do not include high-density areas in the segmentations. Tumours in the lung area should be included in the segmentation while the liver should not

We created segmentations for the 55 cases of the LOLA11 challenge with the U-net(R-231) model. The unaltered masks yielded a mean overlap score of 0.968 and with dense areas removed 0.977.

Fig. 4
figure 4

U-net trained on routine data covered more tumour area compared to reference methods. Box- and swarm plots showing the percentage of tumour volume covered by lung masks that were generated by different methods (318 cases)

Table 5 and Fig. 4 show results for tumour overlap on the 318 volumes of the Lung1 dataset. U-net(R-231) covered more tumour volume mean/median compared to P-HNN (60%/69% versus 50%/44%, p < 0.001) and CIP (34%/13%). Qualitative results for tumour cases for U-net(R-231) and P-HNN are shown in Fig. 5b, c. We found that 23 cases of the Lung1 dataset had corrupted ground truth annotation of the tumours (Fig. 4d). Figure 5e shows cases with little or no tumour overlap achieved by U-net(R-231).

Table 5 Overlap between lung masks and manually annotated tumour volume in the Lung1 dataset
Fig. 5
figure 5

Qualitative results of automatically generated lung masks for tumour cases. Yellow: tumour area covered by the lung mask. Red: tumour area not covered by the lung mask. Original images (a), lung masks generated by our U-net(R-231) (b), lung masks generated by P-HNN (c), corrupted tumour segmentations in the Lung1 dataset (d), and cases with poor tumour overlap of lung masks generated by U-net(R-231) (e)


We showed that training data, sampled from the clinical routine, improves generalizability to a wide spectrum of pathologies compared to public datasets. We assume this lies in the fact that many publicly available datasets do not include dense pathologies such as severe fibrosis, tumour, or effusions as part of the lung segmentation. Further, they are often provided without guarantees about segmentation quality and consistency. While the Anatomy3 dataset underwent a thorough quality assessment, the organisers of the VESSEL12 dataset merely provided lung segmentations as a courtesy supplement for the task of vessel segmentation, and within the LCTSC dataset, “tumour is excluded in most data” and “collapsed lung may be excluded in some scans” [5].

Results indicate that both, size and diversity of the training data, are relevant. State-of-the-art results can be achieved with images from only 36 patients which is in line with previous works [41] achieving a mean DSC of 0.99 on LTRC test data using the U-net(R-36) model.

A large number of segmentation methods are proposed every year, often based on architectural modifications [32] of established models. Isensee et al. [32] showed that such modified design concepts do not improve, and occasionally even worsen, the performance of a well-designed baseline. They achieved state-of-the-art performance on multiple, publicly available segmentation challenges relying only on U-nets. This corresponds to our finding that architectural choice had a subordinate effect on performance.

At the time of submission, the U-net(R-231) achieved the second-highest score among all competitors in the LOLA11 challenge. In comparison, the first ranked method [22] achieved a score of 0.980 and a human reference segmentation achieved 0.984 [27]. Correspondingly, the U-net(R-231) model achieved improved evaluation measures (DSC, HD95, MSD, and tumour overlap) compared to two public algorithms.

There are limitations of our study that should be taken into account. Routine clinical data vary between sites. Thus, extraction of a diverse training dataset from clinical routine may only be an option for centres that are exposed to a wide range of patient variety. Evaluation results based on public datasets are not fully comparable. For example, the models trained on routine data compared to other datasets yielded lower performance in terms of DSC on the LCTSC test data. However, the lower scores for models trained on routine data in LCTSC can be attributed to the lack of very-dense pathologies in the ground truth masks. Figure 3 illustrates cases for which the R-231 model yielded low DSC. The inclusion or exclusion of pathologies such as effusions into lung segmentations is a matter of definition and application. While pleural effusions (and pneumothorax) are technically outside the lung, they are assessed as part of lung assessment and have a substantial impact on lung parenchyma appearance through compression artefacts. Neglecting such abnormalities would hamper automated lung assessment, as they are closely linked to lung function. In addition, lung masks that include pleural effusions greatly alleviate the task of effusion detection and quantification, thus making it possible to remove effusions from the lung segmentation as a post-processing step.

We proposed a general lung segmentation algorithm relevant for automated tasks in which the diagnosis is not known beforehand. However, specialised algorithms for specific diseases could be beneficial in scenarios of analysing cohorts, for which the disease is already known.

In conclusion, we showed that accurate lung segmentation does not require complex methodology and that a proven deep-learning-based segmentation architecture yields state-of-the-art results once diverse (but not necessarily larger) training data are available. By comparing various datasets for training of the models, we illustrated the importance of training data diversity and showed that data from clinical routine can generalise well to unseen cohorts, highlighting the need for public datasets specifically curated for the task of lung segmentation. We draw the following conclusions: (1) translating ML approaches from bench to bedside can require the collection of diverse training data rather than methodological modifications; (2) current, publicly available study datasets do not meet these diversity requirements; and (3) generic, semantic, segmentation algorithms are adequate for the task of lung segmentation. A reliable, universal tool for lung segmentation is fundamentally important to foster research on severe lung diseases and to study routine clinical datasets. Thus, the trained model and inference code are made publicly available under the GPL-3.0 license to serve as an open science tool for research and development and as a publicly available baseline for lung segmentation under

Availability of data and materials

The trained model and inference code is available under The routine data and ground truth annotations used to train the model cannot be shared up to this moment. However, releasing the data is intended.



Chest Imaging Platform


Computed tomography


Dice similarity coefficient


Robust Hausdorff distance


Lung CT Segmentation Challenge 2017


Dataset of 36 cases from LCTSC


Lobe and Lung Analysis 2011


Lung Tissue Research Consortium


Dataset of 36 cases from LTRC


Machine learning


Mean surface distance


Progressive Holistically Nested Networks


Dataset of 231 cases from routine


Dataset of 36 random cases from routine


Vessel Segmentation in the Lung 2012


Dataset of 36 cases from VISCERAL Anatomy3


  1. OECD (2017) Health at a Glance 2017: OECD indicators.

  2. Mansoor A, Bagci U, Foster B et al (2015) Segmentation and image analysis of abnormal lungs at CT: current approaches, challenges, and future trends. Radiographics 35:1056–1076.

  3. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK (2018) Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med 15:e1002683.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Göksel O, Jiménez-del Toro OA, Foncubierta-Rodríguez A, Muller H (2015) Overview of the VISCERAL Challenge at ISBI. In: Proceedings of the VISCERAL Challenge at ISBI 2015. New York, NY

  5. Yang J, Veeraraghavan H, Armato SG 3rd et al (2018) Autosegmentation for thoracic radiation treatment planning: a grand challenge at AAPM 2017. Med Phys 45:4568–4581.

  6. Oakden-Rayner L, Bessen T, Palmer LJ, Carneiro G, Nascimento JC, Bradley AP (2017) Precision radiology: predicting longevity using feature engineering and deep learning methods in a radiomics framework. Sci Rep 7.

  7. Stein JM, Walkup LL, Brody AS, Fleck RJ, Woods JC (2016) Quantitative CT characterization of pediatric lung development using routine clinical imaging. Pediatr Radiol 46:1804–1812.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Korfiatis P, Skiadopoulos S, Sakellaropoulos P, Kalogeropoulou C, Costaridou L (2007) Combining 2D wavelet edge highlighting and 3D thresholding for lung segmentation in thin-slice CT. Br J Radiol 80:996–1004.

    Article  CAS  PubMed  Google Scholar 

  9. Hu S, Hoffman EA, Reinhardt JM (2001) Automatic lung segmentation for accurate quantitation of volumetric X-ray CT images. IEEE Trans Med Imaging 20:490–498.

    Article  CAS  PubMed  Google Scholar 

  10. Chen H, Mukundan R, Butler A (2011) Automatic lung segmentation in HRCT images. International Conference on Image and Vision Computing, In, pp 293–298

  11. Pulagam AR, Kande GB, Ede VKR, Inampudi RB (2016) Automated lung segmentation from HRCT scans with diffuse parenchymal lung diseases. J Digit Imaging 29:507–519.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Sluimer I, Prokop M, van Ginneken B (2005) Toward automated segmentation of the pathological lung in CT. IEEE Trans Med Imaging 24:1025–1038.

    Article  PubMed  Google Scholar 

  13. Iglesias JE, Sabuncu MR (2015) Multi-atlas segmentation of biomedical images: a survey. Med Image Anal 24:205–219.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Li Z, Hoffman EA, Reinhardt JM (2005) Atlas-driven lung lobe segmentation in volumetric X-ray CT images. IEEE Trans Med Imaging 25:1–16.

    Article  Google Scholar 

  15. Sun S, Bauer C, Beichel R (2012) Automated 3-D segmentation of lungs with lung cancer in CT data using a novel robust active shape model approach. IEEE Trans Med Imaging 31:449–460.

    Article  CAS  PubMed  Google Scholar 

  16. Agarwala S, Nandi D, Kumar A, Dhara AK, Sadhu SBTA, Bhadra AK (2017) Automated segmentation of lung field in HRCT images using active shape model. In: IEEE Region 10 Annual International Conference. Proceedings/TENCON, IEEE, pp 2516–2520.

    Chapter  Google Scholar 

  17. Chen G, Xiang D, Zhang B et al (2019) Automatic pathological lung segmentation in low-dose CT image using eigenspace sparse shape composition. IEEE Trans Med Imaging 38:1736–1749.

  18. Sofka M, Wetzl J, Birkbeck N et al (2011) Multi-stage learning for robust lung segmentation in challenging CT volumes. In: Fichtinger G, Martel A, Peters T (eds) International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Berlin, Heidelberg, pp 667–674.

  19. Harrison AP, Xu Z, George K, Lu L, Summers RM, Mollura DJ (2017) Progressive and multi-path holistically nested neural networks for pathological lung segmentation from CT images. In: Descoteaux M, Maier-Hein L, Franz A, Jannin P, Collins D, Duchesne S (eds) International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, pp 621–629.

  20. Korfiatis P, Kalogeropoulou C, Karahaliou A, Kazantzi A, Skiadopoulos S, Costaridou L (2008) Texture classification-based segmentation of lung affected by interstitial pneumonia in high-resolution CT. Med Phys 35:5290–5302.

    Article  PubMed  Google Scholar 

  21. Wang J, Li F, Li Q (2009) Automated segmentation of lungs with severe interstitial lung disease in CT. Med Phys 36:4592–4599.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Soliman A, Khalifa F, Elnakib A et al (2017) Accurate lungs segmentation on CT chest images by adaptive appearance-guided shape modeling. IEEE Trans Med Imaging 36:263–276.

  23. van Rikxoort EM, de Hoop B, Viergever MA, Prokop M, van Ginneken B (2009) Automatic lung segmentation from thoracic computed tomography scans using a hybrid approach with error detection. Med Phys 36:2934–2947.

    Article  PubMed  Google Scholar 

  24. Mansoor A, Bagci U, Xu Z et al (2014) A generic approach to pathological lung segmentation. IEEE Trans Med Imaging 33:2293.

  25. Zhang Y, Brady M, Smith S (2001) Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans Med Imaging 20:45–57.

    Article  CAS  PubMed  Google Scholar 

  26. Rudyanto RD, Kerkstra S, van Rikxoort EM et al (2014) Comparing algorithms for automated vessel segmentation in computed tomography scans of the lung: the VESSEL12 study. Med Image Anal 18:1217–1232.

  27. van Rikxoort EM, van Ginneken B, Kerkstra S (2011) LObe and Lung Analysis 2011 (LOLA11).

  28. Hofmanninger J, Krenn M, Holzer M, Schlegl T, Prosch H, Langs G (2016) Unsupervised identification of clinically relevant clusters in routine imaging data. International Conference on Medical Image Computing and Computer-Assisted Intervention, In, pp 192–200.

    Book  Google Scholar 

  29. Yushkevich PA, Piven J, Hazlett HC et al (2006) User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31:1116–1128.

  30. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention, In, pp 234–241.

    Book  Google Scholar 

  31. Zhou X, Takayama R, Wang S, Hara T, Fujita H (2017) Deep learning of the sectional appearances of 3D CT images for anatomical structure segmentation based on an FCN voting method. Med Phys 44:5221–5233.

    Article  PubMed  Google Scholar 

  32. Isensee F, Petersen J, Kohl SAA, Jäger PF, Maier-Hein KH (2019) nnU-Net: breaking the spell on successful medical image segmentation. arXiv Prepr arXiv:1809.10486

  33. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, In, pp 448–456

    Google Scholar 

  34. Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Cortes C, Lawrence ND, Lee DD et al (eds) Advances in neural information processing systems. Curran Associates, Red Hook, pp 2377–2385

    Google Scholar 

  35. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, In, pp 770–778.

    Book  Google Scholar 

  36. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv Prepr arXiv:1511.07122

  37. Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. Proc IEEE Proceedings of the IEEE conference on computer vision and pattern recognition. pp 472–480.

  38. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–848.

    Article  PubMed  Google Scholar 

  39. Chest Imaging Platform (CIP). Accessed Jun 8, 2020

  40. DeepMind (2018) Library to compute surface distance based performance metrics for segmentation tasks.

  41. Guo F, Ng M, Goubran M et al (2020) Improving cardiac MRI convolutional neural network segmentation on small training datasets and dataset shift: a continuous kernel cut approach. Med Image Anal 61:101636.

    Article  PubMed  Google Scholar 

Download references


We would like to thank Mary McAllister for thorough proofreading of the article.


Research support: Siemens, Novartis, IBM, NVIDIA

Author information

Authors and Affiliations



JH and GL developed the presented idea and designed the experiments. JH implemented the methods and carried out the experiments. FP, SR, JP, and JH performed and validated ground truth annotations. All authors discussed the results and contributed input to the final manuscript. The authors read and approved the final manuscript.

Corresponding authors

Correspondence to Johannes Hofmanninger or Georg Langs.

Ethics declarations

Ethics approval and consent to participate

The local ethics committee of the Medical University of Vienna approved the retrospective analysis of the imaging data for the study (approval number 1154/2014).

Consent for publication

Not applicable

Competing interests

JH speaker fees: Boehringer-Ingelheim. SR consulting activities for contextflow GmbH. HP speakers fees: Boehringer-Ingelheim, Roche, Novartis, MSD, BMS, GSK, Chiesi, AstraZeneca; research support: Boehringer-Ingelheim. GL shareholder/co-founder contextflow GmbH; speaker fees: Roche, Siemens.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hofmanninger, J., Prayer, F., Pan, J. et al. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur Radiol Exp 4, 50 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: