Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem

Hofmanninger, Johannes; Prayer, Forian; Pan, Jeanny; Röhrich, Sebastian; Prosch, Helmut; Langs, Georg

doi:10.1186/s41747-020-00173-2

European Radiology Experimental

Table 3 Evaluation results after training segmentation architectures on different training sets

From: Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem

		Test datasets (DSC) for lung slices only											DSC ± SD		HD95 (mm) ± SD	MSD (mm) ± SD
		Public			Routine
Architecture	Training set	LTRC	LCTSC	VESS12	RRT	Atel	Emph	Fibr	Mass	PnTh	Trau	Norm	All(L)*	All	All	All
U-net	R-36	0.99	0.93	0.98	0.92	0.95	0.99	0.96	0.98	0.99	0.93	0.97	0.97 ± 0.05	0.96 ± 0.08	9.19 ± 18.15	1.43 ± 2.26
	LTRC-36	0.99	0.96	0.99	0.86	0.93	0.99	0.95	0.98	0.98	0.90	0.97	0.97 ± 0.08	0.94 ± 0.13	11.90 ± 22.90	2.42 ± 5.99
	LCTSC-36	0.98	0.97	0.98	0.85	0.91	0.98	0.92	0.98	0.98	0.89	0.97	0.96 ± 0.09	0.92 ± 0.14	10.96 ± 14.85	1.96 ± 2.87
	VISC-36	0.98	0.95	0.98	0.84	0.91	0.98	0.90	0.98	0.98	0.89	0.97	0.96 ± 0.09	0.92 ± 0.15	13.04 ± 19.04	2.05 ± 3.08
ResU-net	R-36	0.99	0.93	0.98	0.91	0.95	0.99	0.96	0.98	0.98	0.93	0.97	0.97 ± 0.06	0.95 ± 0.09	8.66 ± 15.06	1.50 ± 2.34
	LTRC-36	0.99	0.96	0.99	0.86	0.94	0.99	0.95	0.98	0.98	0.89	0.97	0.97 ± 0.08	0.94 ± 0.13	11.58 ± 21.16	2.48 ± 6.24
	LCTSC-36	0.98	0.97	0.98	0.85	0.92	0.98	0.95	0.97	0.98	0.89	0.97	0.96 ± 0.09	0.93 ± 0.14	12.15 ± 19.42	2.36 ± 4.68
	VISC-36	0.97	0.96	0.98	0.84	0.91	0.98	0.89	0.98	0.98	0.89	0.97	0.95 ± 0.09	0.92 ± 0.15	9.41 ± 15.00	1.83 ± 2.92
DRN	R-36	0.98	0.93	0.97	0.88	0.94	0.98	0.95	0.97	0.98	0.92	0.96	0.96 ± 0.07	0.94 ± 0.12	8.96 ± 17.67	1.96 ± 3.97
	LTRC-36	0.98	0.95	0.98	0.85	0.93	0.98	0.94	0.98	0.98	0.89	0.97	0.96 ± 0.08	0.93 ± 0.14	10.94 ± 20.93	2.66 ± 6.66
	LCTSC-36	0.97	0.96	0.97	0.83	0.90	0.98	0.90	0.97	0.97	0.89	0.96	0.95 ± 0.09	0.91 ± 0.15	8.98 ± 13.30	1.92 ± 2.73
	VISC-36	0.96	0.95	0.97	0.83	0.90	0.97	0.92	0.97	0.97	0.87	0.97	0.94 ± 0.10	0.91 ± 0.15	8.96 ± 13.62	1.92 ± 2.83
Deeplab v3+	R-36	0.98	0.92	0.98	0.90	0.93	0.99	0.95	0.98	0.98	0.92	0.97	0.96 ± 0.06	0.95 ± 0.09	8.99 ± 14.32	1.71 ± 2.68
	LTRC-36	0.99	0.94	0.99	0.85	0.93	0.98	0.94	0.98	0.98	0.89	0.97	0.96 ± 0.09	0.93 ± 0.14	11.90 ± 21.80	2.51 ± 6.07
	LCTSC-36	0.98	0.96	0.98	0.85	0.92	0.98	0.93	0.98	0.98	0.89	0.96	0.96 ± 0.08	0.93 ± 0.14	10.47 ± 19.14	2.21 ± 4.67
	VISC-36	0.98	0.96	0.98	0.85	0.93	0.98	0.95	0.98	0.98	0.89	0.97	0.96 ± 0.08	0.93 ± 0.14	10.16 ± 21.21	2.15 ± 4.99

The sets R-36, LTRC-36, LCTSC-36, and LTRC-36 and VISC-36 contained the same number of volumes and slices. The best evaluation scores for models trained on these three datasets are marked in bold, highest for the Dice similarity score (DSC) and lowest for the Robust Hausdorff distance (HD95) and mean surface distance (MSD). Although the different architectures performed comparably, training on routine data outperformed training on public cohort datasets
*The LCTSC ground truth masks do not include high-density areas, and the high number of LTRC test cases dominates the averaged results. Thus, “All(L)” (n = 167) is the mean over all cases including LCTSC and LTRC while “All” (n = 62) does not include the LCTSC or the LTRC cases. For abbreviations, see Tables 1 and 2

Back to article page