Enhancing diagnostic deep learning via self-supervised pretraining on large-scale, unlabeled non-medical images

European Radiology Experimental

Table 1 Characteristics of the datasets utilized in this study

	VinDr-CXR	ChestX-ray14	CheXpert	MIMIC-CXR	UKA-CXR	PadChest
Number of radiographs (total)	18,000	112,120	157,878	213,921	193,361	110,525
Number of radiographs (training set)	15,000	86,524	128,356	170,153	153,537	88,480
Number of radiographs (test set)	3,000	25,596	29,320	43,768	39,824	22,045
Number of patients	N/A	30,805	65,240	65,379	54,176	67,213
Patient age (years) Median Mean ± standard deviation Range (minimum, maximum)	42 54 ± 18 (2, 91)	49 47 ± 17 (1, 96)	61 60 ± 18 (18, 91)	N/A N/A N/A	68 66 ± 15 (1, 111)	63 59 ± 20 (1, 105)
Patient’s sex Females/males [%] Training set, test set	47.8/52.2 44.1/55.9	42.4/57.6 41.9/58.1	41.4/58.6 39.0/61.0	N/A N/A	34.4/65.6 36.3/63.7	50.0/50.0 48.2/51.8
Projections [%] Anteroposterior Posteroanterior	0.0 100.0	40.0 60.0	84.5 15.5	58.2 41.8	100.0 0.0	17.1 82.9
Location	Hanoi, Vietnam	Maryland, USA	California, USA	Massachusetts, USA	Aachen, Germany	Alicante, Spain
Number of contributing hospitals	2	1	1	1	1	1
Labeling method	Manual	NLP (ChestX-ray14 labeler)	NLP (CheXpert labeler)	NLP (CheXpert labeler)	Manual	Manual & NLP (PadChest labeler)
Original labeling system	Binary	Binary	Certainty	Certainty	Severity	Binary
Accessibility of the dataset for research	Public	Public	Public	Public	Internal	Public

The table shows the statistics of the datasets used, including VinDr-CXR [21], ChestX-ray14 [22], CheXpert [23], MIMIC-CXR [24], UKA-CXR [3, 25,26,27,28], and PadChest [29]. The values correspond to only frontal chest radiographs, with the percentages of total radiographs provided. Binary labeling system refers to diagnosing if a finding is present or not. “Severity” refers to classification of the severity of a finding. “Certainty” indicates that a certainty level was assigned to each finding during the labeling by either the experienced radiologists (manual) or an automatic natural language processing—NPL, labeler. Note that some datasets may include multiple radiographs per patient
N/A Not available