Skip to main content

Table 1 Workflow to create a dataset of annotated images to be used for machine learning applications

From: Image annotation and curation in radiology: an overview for machine learning practitioners

Step

Description

Definition of the data of interest

Specifying what kind of data should be collected and annotated for the project, in terms of imaging modality, protocol, anatomy, pathology, and clinical question that are relevant for the application.

Data collection and de-identification

Acquiring the imaging data from the source, such as a PACS system, a DICOM server, or a public repository. The data should be representative of the target population and environment. Data must be de-identified by removing any personal or sensitive information that can identify the patients or the institutions. Compliance with the ethical and legal regulations, such as HIPAA or GDPR, must be ensured.

Annotation

Labelling the data with the information that is needed for the machine learning task, such as bounding boxes, polygons, masks, or tags. A standard protocol or guideline for annotation should be followed. The annotation must be accurate, consistent, and complete. Either custom computer programs or existing software, free or proprietary, may be used to facilitate this process.

Curation

Reviewing and validating the annotated data and resolving any errors or discrepancies. Multiple experts or consensus methods to check the quality and reliability of the annotation may be employed. Software tools can be used to manage and monitor the annotation process.

Storage

Storing and organising the annotated data in a format that is suitable for machine learning, such as DICOM, NIfTI, or PNG. Data must be secure and accessible for the machine learning framework and model. Specific software tools can be employed to track and version the data.