IMAGE SELECTION:

Four public datasets (JSRT, Open-I, ChestXRay14, Padchest) were used to select both nodule and non-nodule images.  The aim was to select images without significant other pathologies. 

Annotation protocol was defined as follows:

  • The expert is asked to annotate solitary nodules, solid or subsolid, located in a region of otherwise normal appearance.

  • In cases with a maximum of three nodules all the nodules are annotated. Cases with more than three nodules are excluded.

  • The annotated nodules should fall (according to the expert's visual judgment) into the size range of 6mm to 30mm.

  • The annotator should exclude nodular structures which are totally calcified, or part of an abnormal pattern of e.g. consolidation, atelectasis, honeycombing, etc.

  • A nodule can be annotated in an image where such a pattern is present, provided that the nodule is in an area of normal background and well separated from the abnormal pattern.

  • Images with a predominant pattern of abnormal tissue, nodular structures, or clusters of nodules are excluded.

These public datasets were chosen because the license permits us to re-share the images, or we had explicit permission from the owners to do so.

Since the JSRT labels were provided by radiological examination with CT as the reference standard and with nodule location information available we did not re-label or re-annotate any of these cases.  For all other datasets, we select candidate images based on the dataset labels and review the images before selection/annotation for the node21 challenge.

NODULE IMAGE SELECTION:

OpenI:

We selected 82 PA frontal CXR images based on the associated reports using XML tags.   A radiologist went over them and the set was reduced to 54 cases with annotated nodules based on the annotation protocol.

**JSRT: **

All JSRT cases are already indicated as nodule or non-nodule in the metadata provided with the set. The original set contains 154 images with nodules. 

The metadata of the nodule set was traversed to check for the dimension of the indicated nodule. Images were excluded if the metadata indicated the nodule diameter was smaller than 6mm or larger than 30mm (exclusions are JPCLN102.IMG, JPCLN084.IMG, JPCLN012.IMG, JPCLN005.IMG, JPCLN042.IMG)

This resulted in 149 images with nodule locations indicated in the metadata. Bounding boxes were created using the nodule locations and diameters.

ChestXRay14: 

The metadata file provided with the dataset was traversed to find images of interest as follows:  

  • Only images with ‘View Position’ of ‘PA’ were included

  • Images with the following finding combinations were included:

    • ‘Nodule’

    • ‘Cardiomegaly|Nodule’

    • ‘Nodule|Pneumothorax’

    • ‘Cardiomegaly|Nodule|Pneumothorax’

-  Where a patient had multiple images matching these criteria the one with the earliest Follow-up number was selected.

This resulted in 1586 candidate nodule images which were reviewed by a radiologist using the annotation protocol.

Following radiology review, 617 images were retained with annotated nodules.

Padchest:

The metadata file provided with the dataset was traversed to find images of interest as follows:

  • Only images with ‘Projection’ of ‘PA’ were included

  • The ‘Labels’ field included ‘nodule’ but not ‘pseudonodule’

  • The ‘Labels’ field excluded mentions of the following text terms [‘alveolar’ ‘tuberculosis’ ‘interstitial’ ‘cavitation’ ‘pneumonia’ ‘infiltrates’ ‘fibro’ ‘effusion’ ‘trapping’ ‘bronchiectasis’ ‘scoliosis’ ‘hernia’ ‘COPD’ ‘atelactasis’ ‘mass’ ‘asbestosis’ ‘consolidation’ ‘emphysema’ ‘metast’]

  • Where a patient had multiple images matching these criteria the one with the earliest ‘StudyDate_DICOM’ was selected

This resulted in 908 candidate nodule images which were reviewed by a radiologist using the annotation protocol.

Following radiology review, 314 images were retained with annotated nodules.

NON-NODULE IMAGE SELECTION

JSRT:

All JSRT cases are already indicated as nodule or non-nodule in the metadata provided with the set. The original set contains 93 images without nodules.  All of these were retained. 

OpenI:

Since reliable orientation information was not available for this dataset a neural network trained to differentiate lateral, antero-posterior and postero-anterior views was first used to label the view-type for each image in the set.

Next, the patient metadata provided with the dataset was traversed and patients with ‘finding’ of ‘normal’ were included.  For each of these patients, the ‘PA’ orientation image was selected. (A small number of cases where no PA image or >1 PA images was found were skipped).

This resulted in a selection of 1164 images.

These were reviewed by a member of the node21 team to check for any severe abnormalities, obvious nodules, apparent images of children (age not provided by open-i) or rotated images.  After removing any such cases there were 1157 remaining non-nodule images.  A nodule detection algorithm was run on these and any detected nodules with probability >0.5 were reviewed first by a member of the node21 team and secondly, if a decision was unsure, by a radiologist.  Any case where nodule could not be excluded was removed, leaving 1102 remaining non-nodule images.

ChestXRay14:

The metadata file provided with the dataset was traversed to find images of interest as follows:  

  • Only images with ‘View Position’ of ‘PA’ were included

  • Images with ‘Finding Label’ of ‘No Finding’ were included

  • Where a patient had multiple images matching these criteria the one with earliest Follow-up number was selected

This resulted in 22452 candidate non-nodule images of which 1500 were randomly selected for review by a node21 team member.  During review images with severe abnormalities, rotation, obvious nodules or incorrect orientation were removed. Images of subjects below 16 years of age were also removed by using the metadata.  This left a selection of 1239 images. A nodule detection algorithm was run on these and any detected nodules with probability >0.5 were reviewed first by a member of the node21 team and secondly, if a decision was unsure, by a radiologist.  Any case where nodule could not be excluded was removed, leaving 1187 remaining non-nodule images.

PadChest:

The metadata file provided with the dataset was traversed to find images of interest as follows:

  • Only images with ‘Projection’ of ‘PA’ were included

  • The ‘Labels’ field was (exactly) ‘normal’ Where a patient had multiple images matching these criteria the one with the earliest ‘StudyDate_DICOM’ was selected

This resulted in 28688 candidate non-nodule images of which 1500 were randomly selected for review by a node21 team member.  During review images with severe abnormalities, rotation, obvious nodules or incorrect orientation were removed. Images of subjects below 16 years of age were also removed by using the metadata.

This left a selection of 1406 images. A nodule detection algorithm was run on these and any detected nodules with probability >0.5 were reviewed first by a member of the node21 team and secondly, if a decision was unsure, by a radiologist.  Any case where nodules could not be excluded was removed, leaving 1366 remaining non-nodule images.