In theory, massive amounts of training data for the development of complex AI algorithms should be available in healthcare and radiology. For example, assuming that an average-sized radiological department performs 200 CT scans per year for the detection of pulmonary embolism of which around from 25 to 50 show a visible thrombus in the pulmonary arteries, over the course of 10 years this would amount to a total of at least 2000 scans with at least 250 showing pulmonary embolism. These imaging exams could be made accessible through the department’s Picture Archiving and Communication System (PACS) and should be accompanied by the corresponding radiological reports with at least one clear statement regarding the presence or absence of a pulmonary embolism.
One would expect that the most difficult part could be to develop an algorithm able to detect the emboli within the pulmonary arteries automatically. However, considering recent technological advances, it appears that this could be solved relatively easily if sufficient accurately labelled data were available. However, access to accurately labelled data is problematic. The vast majority of radiological reports are currently written as unstructured narrative text and extraction of information contained in such unstructured reports is challenging.
Natural language processing (NLP) has made substantial improvements in the last decades and could in theory help to mine unstructured reports for relevant information [3,4,5]. However, one crucial challenge remains for NLP algorithms. In a large number of cases, conventional radiological reports show a considerable amount of variation not only in language but also in the findings reported. In some clinical settings, examinations are highly specific and reported findings relatively consistent allowing for accurate classification of the report content. One such case is CT for pulmonary embolism where the excellent accuracy of classification could be shown [6]. However, in other examinations, such as magnetic resonance imaging (MRI) of the lumbar spine, there is such a marked variability in interpretative findings reported that accurate classification of report content is highly unlikely [7]. This problem is even more aggravated when the radiologist’s impression of a particular examination also considers other clinical information that is not included in the radiological report.
This was most notably demonstrated by a study published by a group from Stanford University under the lead of Andrew Ng, that, in its first version [8], claimed that an algorithm showed superhuman performance in recognising pneumonia on plain chest radiography, This claim was quickly picked up and echoed through various media sources [9]. However, there was a crucial issue with the original training dataset for which labels had been extracted using text-mining techniques [10]. Among other issues, there was a certain proportion of images with wrong labels as well as overlap between different labels such as consolidation, atelectasis, infiltration, and pneumonia (Fig. 1). Moreover, while all of these findings may have a similar visual appearance, to diagnose pneumonia, usually clinical information and laboratory results are taken into consideration which may not be included in the final report. The initial claim that the algorithms showed superhuman performance in the detection of pneumonia has since been put into perspective taking into account these issues [11].
Nevertheless, this prominent example clearly shows that when developing AI algorithms, the most critical step is to extract meaningful and reliable labels to establish a valid ground truth. At this point, it seems evident that similar problems will potentially always be encountered when using unstructured radiological reports as the basis for extracting labels. In this context, structured radiological reporting could offer a solution through standardised report content and more consistent language. Also, apart from being difficult to analyse for automated systems seeking to extract knowledge from the reports, it has been shown that referrers also favour more structured radiological reports [12,13,14,15,16]. Moreover, various studies showed that structured reports showed greater clarity and greater completeness with regard to relevant information than unstructured reports [17, 18].