COVID-19 is a viral infectious disease transmitted through air droplets and close distance contacts caused from infection by SARS-CoV-2. The outbreak of SARS-CoV-2 epidemic has resulted in a global health emergency, more diffuse than the coronavirus severe acute respiratory syndrome (SARS) in 2003, both caused by viruses belonging to the Coronaviridae family [3]. As a matter of fact, on 13 March 2020, the WHO declared the SARS-CoV-2 outbreak a pandemic [19].
Diagnosing the disease quickly and accurately is a clinical need, and CXR is a vital diagnostic tool for COVID-19 in emergency. However, its performance in the diagnosis of COVID-19 cases has not yet been reported by large studies. This study collected a total of 250 COVID-19 patients who had CXR with a positive RT-PCR, enriched with 135 patients with CXR and a negative RT-PCR test, and other 115 non-COVID-19 patients with CXR in an equivalent period preceding the epidemic, to train and test a CNN-based deep learning classifier.
The main finding of our study is that the performance of our deep learning system proved intriguing both at 10-fold cross-validation and when challenged on an independent new dataset.
It is highly likely that a human reading completely informed about history and clinical data or during booming of the epidemic with an increasing prevalence would have been able to strongly increase the sensitivity, but a trade-off could be paid in terms of specificity. This phenomenon is well visible in the case of CT in the recent report by Ai et al. [20] where a 97% sensitivity is counterbalanced by a 25% specificity. The performance of our deep learning system appears interesting for the well balance between the two terms, with 0.80 sensitivity and 0.81 specificity. No a priori selected lung pattern was used to train the deep learning system, in order to avoid human bias or limitations. Such deep learning system includes many convolutional filters that learned a rich feature representation from millions of images of different classes (from low to high level of feature complexity) and used this variety of feature representation for the COVID-19 versus non-COVID-19 image classification task.
This constitutes a promising starting point, especially when considering the technical issue regarding bedside CXRs that were evaluated by the deep learning system: only one anteroposterior projection in supine position. This means that there is room for improving CXR performance in these patients. On the one side, the deep learning classifier can be trained on thousands of cases, applying the deep learning general principle: the more data you use for training, a higher performance you get [21]. On the other side, CXR using the standard approach, i.e., both the posteroanterior and lateral projections to the patient standing in upright position, could substantially increase the quality of the radiograms and the three-dimensional information provided. However, this “state-of-the-art” approach is not always easy to carry on in the epidemic context, taking into consideration the possible contemporary use. Therefore, while all suspected COVID-19 patients ought to be isolated, this deep learning tool may help guide their clinical workflow, for instance sending patients to thoracic CT when human reading is negative and deep learning classifier reading is positive.
It is important to recognise that the role of CXR in patients’ evaluation depends on the severity of infection in the individual patient, as well as on the COVID-19 prevalence in the community. In individuals who are asymptomatic or have mild disease, the sensitivity of CXR could fail if performed in the first 48 h from the onset of symptoms. Individuals with very mild disease may eventually have positive RT-PCR results but would have been missed by early CXR. Conversely, CXR should be most useful in patients who are acutely ill and symptomatic in areas with relatively high prevalence, such as Lombardy, Italy in spring 2020. In this scenario, patients with the clinical condition and CXR findings attributable to COVID-19 could be considered as possibly infected by the virus when the first RT-PCR test result is still not available or negative.
Since the beginning of the pandemic, there have been numerous published studies that use machine learning or CNNs for diagnosing COVID-19 from CXR. Among these, the majority [10,11,12,13,14] used transfer-learning techniques for automatically classifying COVID-19, based on different pretrained CNNs (e.g., VGG-19, SqueezeNet, DenseNet); in some of them, some optimisations were also performed (e.g., Bayesian optimisation by Ucar et al. [13] or hierarchical classification by Pereira et al. [14]).
As a first point of comparison, none of the considered published papers [10,11,12,13,14] used an independent testing set (neither temporally nor spatially independent) to obtain an unbiased evaluation of the performance of their machine learning classifiers, which was made instead in the present paper. Thus, the performance obtained by the referenced literature may suffer from overfitting issues.
Furthermore, the referenced works [10,11,12,13,14] did not perform a comparison between the performance obtained by the machine learning classifiers and those obtained by expert radiologists. Our study compares the performance of a deep learning classifier to the radiologists’ reading for COVID-19 diagnosis, thus providing interesting information about the potential adoption of the proposed classifier as a second reader to support decision in clinical practice.
As a last point of comparison, such works [10,11,12,13,14] used publicly available anonymised image sets for normal or non-COVID-19 CXRs collected by a group of imaging centres before the COVID-19 pandemic. As COVID-19 CXRs, instead, these studies used publicly available anonymised image sets collected by a different group of imaging centres during the COVID-19 pandemic.
Thus, in these published papers, the intrinsic systematic image differences among image sets of normal or non-COVID-19 CXRs distinct from image sets of COVID-19 CXRs (e.g., distinct acquisition protocols, imaging systems, subjects origin) may have inflated the final classification performance of the deep learning models.
For example, most of the published papers used non-COVID-19 CXRs from the well-known Kaggle database “Chest X-Ray Images (Pneumonia)” [22]. However, this database is composed of CXRs of normal subjects and patients with non-COVID19 pneumonia (other community-acquired pneumonia) obtained from retrospective cohorts of paediatric patients of 1 to 5 years old (from Guangzhou Women and Children’s Medical Center, Guangzhou). If these CXRs are classified against nonpaediatric COVID-19 patients, this may heavily affect the classification performance of the deep learning models.
This study has some limitations. First, we trained our model on a limited number of cases, from the same geographical area. We could improve performances and generalisability of our model by adding new images, in particular from different geographical regions than Lombardy. Second, the independent testing set was only temporally separate but not geographically separate from the training one and also relatively small. This may lead to an algorithm well-fitted on a local scale, with an unknown performance on distant cohorts. In this regard, future studies should be focused on testing the algorithm on CXR image sets originating from other populations and geographical areas, and eventually reducing overfitting by including such datasets in training. For a worldwide generalisation, the algorithm should probably be retrained and tuned also including CXR images from noncaucasian races such as Asian and African ones. Third, we did not include other data such as clinical conditions such as symptoms and pulse oximeter data as complementary information to be given to the deep learning model and the human readers, a perspective to be explored in future studies. Moreover, the dataset used to train the algorithm was designed to give a binary decision (COVID-19 versus non-COVID-19). However, this decision may be dependent on disease severity. The dataset was enriched with x-ray chest radiographs of non-consecutive sex- and age-matched patients but it was not matched for the severity of lung abnormalities; thus, the algorithm is not currently able to classify the severity of lung abnormalities but only the presence or absence of lung abnormalities associated to COVID-19-positive patients (Fig. 3). A further limitation of the algorithm may be posed by the inclusion of both anteroposterior (bedside) and posteroanterior (standard) CXR projections in the training dataset, whereas only bedside, anteroposterior CXRs were present in the independent testing dataset. Indeed, concerning patient position during CXR exams, as reported in the “Methods” section, while patients from Centre 1 only had bedside anteroposterior (AP) CXR exams, patients from Centre 2 had either AP or PA projections, the latter belonging mostly to healthy controls or to healthier patients. Thus, the deep learning system was trained on a dataset composed by frontal PA or AP images. While this could have led to a source of bias, with AP projections being linked to patients with more severe disease, and therefore COVID-19 cases, and PA images being linked to healthy subjects, the good LR- seems to suggest that it was not the case, as the algorithm was able to correctly identify a substantial number of cases of the independent testing set as negative. As a matter of fact, the performance of the algorithm seemed to err towards false negative interpretation, as opposed to vice versa, further suggesting that the presence of different projections did not hinder the performance on the testing set.
In conclusion, we preliminarily showed that a CNN-based deep learning system applied to bedside CXR in patients suspected to be positive COVID-19, even though trained on a limited number of cases, allowed to reach a 0.80 sensitivity and a 0.81 specificity in an independent temporally separate patient group. The system could be used as a second opinion tool in studies aimed at assessing its usefulness for improving the final sensitivity and specificity in different geographical and temporal setting. Its performance could be improved by training on larger multi-institutional and multi-geographical datasets, and the role of the algorithm as the second reader of CXR images could be assessed in different instances in patients suspected for SARS-CoV-2 infection, especially as several countries are facing repeated waves of the COVID-19 pandemic. This deep learning tool may help guide the clinical workflow, for instance sending patients to thoracic CT when human reading is negative and results from the deep learning classifier are positive.