Skip to main content

Table 3 Evaluation results after training segmentation architectures on different training sets

From: Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem

  

Test datasets (DSC) for lung slices only

DSC ± SD

HD95 (mm) ± SD

MSD (mm) ± SD

Public

Routine

   

Architecture

Training set

LTRC

LCTSC

VESS12

RRT

Atel

Emph

Fibr

Mass

PnTh

Trau

Norm

All(L)*

All

All

All

U-net

R-36

0.99

0.93

0.98

0.92

0.95

0.99

0.96

0.98

0.99

0.93

0.97

0.97 ± 0.05

0.96 ± 0.08

9.19 ± 18.15

1.43 ± 2.26

LTRC-36

0.99

0.96

0.99

0.86

0.93

0.99

0.95

0.98

0.98

0.90

0.97

0.97 ± 0.08

0.94 ± 0.13

11.90 ± 22.90

2.42 ± 5.99

LCTSC-36

0.98

0.97

0.98

0.85

0.91

0.98

0.92

0.98

0.98

0.89

0.97

0.96 ± 0.09

0.92 ± 0.14

10.96 ± 14.85

1.96 ± 2.87

VISC-36

0.98

0.95

0.98

0.84

0.91

0.98

0.90

0.98

0.98

0.89

0.97

0.96 ± 0.09

0.92 ± 0.15

13.04 ± 19.04

2.05 ± 3.08

ResU-net

R-36

0.99

0.93

0.98

0.91

0.95

0.99

0.96

0.98

0.98

0.93

0.97

0.97 ± 0.06

0.95 ± 0.09

8.66 ± 15.06

1.50 ± 2.34

LTRC-36

0.99

0.96

0.99

0.86

0.94

0.99

0.95

0.98

0.98

0.89

0.97

0.97 ± 0.08

0.94 ± 0.13

11.58 ± 21.16

2.48 ± 6.24

LCTSC-36

0.98

0.97

0.98

0.85

0.92

0.98

0.95

0.97

0.98

0.89

0.97

0.96 ± 0.09

0.93 ± 0.14

12.15 ± 19.42

2.36 ± 4.68

VISC-36

0.97

0.96

0.98

0.84

0.91

0.98

0.89

0.98

0.98

0.89

0.97

0.95 ± 0.09

0.92 ± 0.15

9.41 ± 15.00

1.83 ± 2.92

DRN

R-36

0.98

0.93

0.97

0.88

0.94

0.98

0.95

0.97

0.98

0.92

0.96

0.96 ± 0.07

0.94 ± 0.12

8.96 ± 17.67

1.96 ± 3.97

LTRC-36

0.98

0.95

0.98

0.85

0.93

0.98

0.94

0.98

0.98

0.89

0.97

0.96 ± 0.08

0.93 ± 0.14

10.94 ± 20.93

2.66 ± 6.66

LCTSC-36

0.97

0.96

0.97

0.83

0.90

0.98

0.90

0.97

0.97

0.89

0.96

0.95 ± 0.09

0.91 ± 0.15

8.98 ± 13.30

1.92 ± 2.73

VISC-36

0.96

0.95

0.97

0.83

0.90

0.97

0.92

0.97

0.97

0.87

0.97

0.94 ± 0.10

0.91 ± 0.15

8.96 ± 13.62

1.92 ± 2.83

Deeplab v3+

R-36

0.98

0.92

0.98

0.90

0.93

0.99

0.95

0.98

0.98

0.92

0.97

0.96 ± 0.06

0.95 ± 0.09

8.99 ± 14.32

1.71 ± 2.68

LTRC-36

0.99

0.94

0.99

0.85

0.93

0.98

0.94

0.98

0.98

0.89

0.97

0.96 ± 0.09

0.93 ± 0.14

11.90 ± 21.80

2.51 ± 6.07

LCTSC-36

0.98

0.96

0.98

0.85

0.92

0.98

0.93

0.98

0.98

0.89

0.96

0.96 ± 0.08

0.93 ± 0.14

10.47 ± 19.14

2.21 ± 4.67

VISC-36

0.98

0.96

0.98

0.85

0.93

0.98

0.95

0.98

0.98

0.89

0.97

0.96 ± 0.08

0.93 ± 0.14

10.16 ± 21.21

2.15 ± 4.99

  1. The sets R-36, LTRC-36, LCTSC-36, and LTRC-36 and VISC-36 contained the same number of volumes and slices. The best evaluation scores for models trained on these three datasets are marked in bold, highest for the Dice similarity score (DSC) and lowest for the Robust Hausdorff distance (HD95) and mean surface distance (MSD). Although the different architectures performed comparably, training on routine data outperformed training on public cohort datasets
  2. *The LCTSC ground truth masks do not include high-density areas, and the high number of LTRC test cases dominates the averaged results. Thus, “All(L)” (n = 167) is the mean over all cases including LCTSC and LTRC while “All” (n = 62) does not include the LCTSC or the LTRC cases. For abbreviations, see Tables 1 and 2