Skip to main content

Table 1 Characteristics of the datasets utilized in this study

From: Enhancing diagnostic deep learning via self-supervised pretraining on large-scale, unlabeled non-medical images

 

VinDr-CXR

ChestX-ray14

CheXpert

MIMIC-CXR

UKA-CXR

PadChest

Number of radiographs (total)

18,000

112,120

157,878

213,921

193,361

110,525

Number of radiographs (training set)

15,000

86,524

128,356

170,153

153,537

88,480

Number of radiographs (test set)

3,000

25,596

29,320

43,768

39,824

22,045

Number of patients

N/A

30,805

65,240

65,379

54,176

67,213

Patient age (years)

Median

Mean ± standard deviation

Range (minimum, maximum)

42

54 ± 18 (2, 91)

49

47 ± 17 (1, 96)

61

60 ± 18 (18, 91)

N/A

N/A

N/A

68

66 ± 15 (1, 111)

63

59 ± 20 (1, 105)

Patient’s sex

Females/males [%]

Training set, test set

47.8/52.2

44.1/55.9

42.4/57.6

41.9/58.1

41.4/58.6

39.0/61.0

N/A

N/A

34.4/65.6

36.3/63.7

50.0/50.0

48.2/51.8

Projections [%]

Anteroposterior

Posteroanterior

0.0

100.0

40.0

60.0

84.5

15.5

58.2

41.8

100.0

0.0

17.1

82.9

Location

Hanoi, Vietnam

Maryland, USA

California, USA

Massachusetts, USA

Aachen, Germany

Alicante, Spain

Number of contributing hospitals

2

1

1

1

1

1

Labeling method

Manual

NLP (ChestX-ray14 labeler)

NLP (CheXpert labeler)

NLP (CheXpert labeler)

Manual

Manual & NLP (PadChest labeler)

Original labeling system

Binary

Binary

Certainty

Certainty

Severity

Binary

Accessibility of the dataset for research

Public

Public

Public

Public

Internal

Public

  1. The table shows the statistics of the datasets used, including VinDr-CXR [21], ChestX-ray14 [22], CheXpert [23], MIMIC-CXR [24], UKA-CXR [3, 25,26,27,28], and PadChest [29]. The values correspond to only frontal chest radiographs, with the percentages of total radiographs provided. Binary labeling system refers to diagnosing if a finding is present or not. “Severity” refers to classification of the severity of a finding. “Certainty” indicates that a certainty level was assigned to each finding during the labeling by either the experienced radiologists (manual) or an automatic natural language processing—NPL, labeler. Note that some datasets may include multiple radiographs per patient
  2. N/A Not available