Data

There are different datasets associated with NODE21. The training dataset is public and is posted on Zenodo. The experimental test set and final test sets are private and will not be released.  The final test set was only used in the phases prior to January 2022 (results are for use in a publication currently in preparation).

Preprocessing

All images are provided in their original format as well as in a preprocessed .mha format.  Note that private test data is also preprocessed so we recommend the use of the preprocessed set.  The preprocessing used code from the OpenCXR library to standardize image appearance by

  1. Removal of homogeneous border regions
  2. Energy-based normalization of image intensity values, implementation following this paper
  3. Segmentation of the lung fields and cropping the image to that region
  4. Resizing the image to 1024x1024 pixels preserving aspect ratio and using padding on the shorter side

If participants wish to preprocess other images in a similar way,  this can be achieved with code similar to that shown below:

    import opencxr
    from opencxr.utils.file_io import read_file, write_file
    cxr_std_algorithm = opencxr.load(opencxr.algorithms.cxr_standardize)
    full_cxr_file_path = input_folder_cxr + '/' + input_cxr_file

    # read a file (supports dcm, mha, mhd, png)
    img_np, spacing, _ = read_file(full_cxr_file_path)
    # Do standardization of intensities, cropping to lung bounding box, and resizing to 1024
    std_img, new_spacing, size_changes = cxr_std_algorithm.run(img_np, spacing)
    # write the standardized file to disk
    output_cxr_loc = output_folder_cxr + '/' + input_cxr_file
    write_file(output_cxr_loc, std_img, new_spacing)

Training dataset

We provide a NODE21 public CXR training dataset. This dataset consists of frontal chest radiographs with annotated bounding boxes around nodules. It consists of 4882 frontal chest radiographs where 1134 CXR images (1476 nodules) are annotated with bounding boxes around nodules and the remaining 3748 images are free of nodules hence represent the negative class. The images in this set are from public datasets that allow us to remix and redistribute. They come from the following sources:

  • JSRT [1]
  • PadChest [2]
  • Chestx-ray14 [3] 
  • Open-I [4]

This dataset can be used for training systems in both the detection and generation tracks. The annotations were provided by our chest radiologists. This dataset is under the folder called dataset_node21/cxr_images. If you would like to read more about the selection and annotation process, please refer to this page. We provide both original and preprocessed versions of the dataset. Each version of the dataset (preprocessed or original) contains a label file called *'metadata.csv', *which denotes the location of the nodule bounding boxes (x, y, width, height, label). The label is 1 if an image contains any nodule, it is 0 otherwise.

For participants working on the generation track, we additionally provide an example label file 'simulated_metadata.csv' (with the data supplied on zenodo), which denotes the location of nodules that need to be generated by a generation algorithm for each non-nodule CXR image (CXR image with label==0). 

Further, for the generation track, we provide a public set of NODE21 CT patches (see node21_dataset/ct_patches). These are patches of nodules from CT scans. The patches are 50 x 50 x 50 mm resampled to voxels of 1 x 1 x 1 mm. The patches originate from the LUNA16 dataset [5][6] and can be used to create artificial nodules in given chest radiographs as it was done in the baseline generation algorithm following the method from Litjens et al.

Private test datasets

There are also two private test sets: the experimental test set and the final test set (the final test set is no longer available for testing since January 2022). These sets contain frontal radiographs with or without nodules and the reference standard for all these images has been set with the availability of a CT scan of the same subject taken within a maximum of 60 days interval from the radiograph.

Experimental test set

The first private test set will be used to rank and evaluate submitted Algorithms throughout the challenge. This set contains 281 frontal chest X-rays, 166  of which are positive (with nodules).

Final test set

The final test set was only used in initial phases of the challenge prior to January 2022.  The results on this set were collected and reserved for use in a publication.  This set contains at least 298 frontal radiographs with or without nodules and they originate from multiple medical centers and have been acquired with multiple different x-ray machines.

We also have asked twelve radiologists to read the images in these test sets. This allows us to compare the performance of the computer systems with the performance of human expert readers.  

References

[1] Shiraishi, J., Katsuragawa, S., Ikezoe, J., Matsumoto, T., Kobayashi, T., Komatsu, K.i., Matsui, M., Fujita, H., Kodera, Y., Doi, K., 2000. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. American Journal of Roentgenology 174, 71–74. 10.2214/ajr.174.1.1740071.

[2] Bustos, A., Pertusa, A., Salinas, J.M., de la Iglesia-Vaya, M., 2020. PadChest: A large chest x-ray image dataset with multi-label annotated reports. Medical Image Analysis 66, 101797. 10.1016/j.media.2020.101797.

[3] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M., 2017b. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106. 10.1109/cvpr.2017.369.

[4] Demner-Fushman, D., Antani, S., Simpson, M., Thoma, G.R., 2012. Design and Development of a Multimodal Biomedical Information Retrieval System. Journal of Computing Science and Engineering 6, 168–177. 10.5626/JCSE.2012.6.2.168.

[5] Andrey Fedorov, Matthew Hancock, David Clunie, Mathias Brochhausen, Jonathan Bona, Justin Kirby, John Freymann, Steve Pieper, Hugo Aerts, Ron Kikinis1, Fred Prior, 2019. Standardized representation of the LIDC annotations using DICOM. The Cancer Imaging Archive. 10.7937/TCIA.2018.H7UMFURQ.

[6] Setio et al., Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge, Medical Image Analysis 42, 10.1016/j.media.2017.06.015.