Multicenter Validation Study of an Artificial Intelligence Tool for Automatic Classification of Chest X-rays

Sponsor

Hospital Italiano de Buenos Aires (Other)

Overall Status

Enrolling by invitation

CT.gov ID

NCT04991987

Collaborator

(none)

385

Enrollment

Location

Anticipated Duration (Months)

29.7

Patients Per Site Per Month

Study Details

Study Description

Brief Summary

A current problem in Radiology Departments is the constant increase in the number of studies performed. Currently the largest volume of studies belongs to plain x-rays. This problem is intensified by the shortage of specialists with dedication and experience in their interpretation. In the field of computer science, an area of study called Artificial Intelligence (AI) has emerged, which consists of a computer system that learns to perform specific routine tasks, and can complement or imitate human work. Since 2018, Hospital Italiano de Buenos Aires has been running the TRx program, which consists of the development of an AI-based tool to detect pathological findings in chest x-rays. The intended use of this tool is to assist non-imaging physicians in the diagnosis of chest x-rays by automatically detecting radiological findings. The present multicenter study seeks to externally validate the performance of an AI tool (TRx v1) as a diagnostic assistance tool for chest x-rays.

Condition or Disease	Intervention/Treatment	Phase
Pneumothorax Pleural Effusion Bone Fracture Consolidation Opacity

Detailed Description

A current problem in Radiology Departments is the constant increase in the number of studies performed. This ever-increasing volume of information implies an increase in the time that medical specialists must dedicate to report these studies. The methodology carried out for reporting varies according to the imaging modality, which in high complexity centers includes radiology, computed tomography, magnetic resonance imaging and ultrasound, among others. Currently the largest volume of studies belongs to plain x-rays. At Hospital Italiano de Buenos Aires (HIBA) more than 220,000 x-rays were performed during 2019, and within this group more than 50% of the practices are chest x-rays, which are performed as a method of initial detection of potentially serious pathologies (pulmonary nodule, pneumonia, pneumothorax).

This imaging modality is not attractive and is not explored by the new generations of imaging specialists, who prefer to move towards more modern and complex methods such as computed tomography or magnetic resonance imaging. Therefore, the problem of the increasing volume of plain x-rays to be analyzed is intensified by the shortage of specialists with dedication and experience in their interpretation.

In the field of computer science, an area of study called Artificial Intelligence (AI) has emerged, which consists of a computer system that learns to perform specific routine tasks, and can complement or imitate human work. The developer must tell the AI system what response is desired from a given stimulus. An example of this is the spell checker in a word processor.

The field of AI encompasses a wide variety of sub-fields and specific techniques, such as Machine Learning (ML) or Deep Learning (DL). ML encompasses any tool in which computerized data is used to fit a model that draws conclusions from this input data. Algorithms are trained to learn given tasks based on a set of previously classified information. This also includes traditional techniques for creating predictive models or classification models. E-mail spam filtering is an example of ML. Neural networks are one of the tools included in ML.

Finally, DL is a type of ML that began to appear in 2015, which consists of adding layers to a traditional neural network and thus creating a nonlinear model with a higher degree of complexity since it increases the number of parameters to be adjusted. This network is exposed to a training dataset, which consists of already labeled information, and "learns" to label new information by mimicking the labeling criteria of the dataset. This learning is actually an iterative adjustment of the model parameters, which are iteratively modified according to the error between the original labeling and the labeling suggested by the network. Once the model is trained, its parameters are fixed and it can be used to infer labels of new information whose labeling is unknown. DL methods have been found to perform much better in data analysis than traditional methods. DL already has applications in everyday life, such as voice assistants in smart phones, or automatic face recognition and labeling in social networks.

DL applied to image processing is based on a method called convolutional neural networks. Its application has been investigated in the field of medical imaging, finding improvements in performance, from object detection (anatomical or pathological structures in radiological images) to segmentation tasks.

Since 2018, Hospital Italiano de Buenos Aires has been running the TRx program, which consists of the development of an AI-based tool to detect pathological findings in chest x-rays. The project is part of the Artificial Intelligence in Healthcare program of Hospital Italiano de Buenos Aires, and is carried out by a multidisciplinary team of professionals, including biomedical engineers, data scientists, radiologists, Clinical clinical informaticians, methodologists, and software engineers. TRx is a DL model, developed and validated at HIBA, which detects four types of radiological findings on chest x-rays: pulmonary opacities (nodules, masses, pneumonia, consolidations, ground glass, or atelectasis), pneumothorax, pleural effusions, and rib fractures. This detection is performed through four independent modules that are integrated into a single system. When processing an x-ray, TRx reports different types of results. First, the unified TRx system indicates dichotomously whether the image is suspicious for a pathological finding, or if it is possibly a normal chest x-ray. Secondly, each of the four modules indicates in particular whether a finding of pulmonary opacity, pneumothorax, pleural effusion, or rib fracture was detected, respectively. Finally, TRx enables the visualization of a heat map over the image indicating in color the region of the thorax where a suspected finding was detected.

The intended use of this tool is to assist non-imaging physicians in the diagnosis of chest x-rays by automatically detecting radiological findings. TRx version 1.0 (TRx v1) evaluates frontal chest x-rays of patients older than 14 years of age for four types of findings: pulmonary opacities, pleural effusion, fractures, and pneumothorax. The objective of this tool is to enhance the diagnostic performance of non-imaging physicians by providing assistance or a "preliminary report".

One fact that is stressed in AI is that models must be replicable; the model must give the same or better results if given the same input. Although this seems obvious, it is in contrast to humans, who commonly exhibit both inter and intra-observer variability. The standard of an AI model should at least match the human performance it will assist. Replicability depends on the problem, and the amount of variability depends on the specific task at hand.

There are authors who report that an AI model may present difficulties in providing accurate predictions when applied to new situations or populations (i.e., to which it was not exposed during training). Whereas radiologists are able to successfully adapt to differences in images (whether due to slice thickness, scanner marking, field strength, gradient intensity or contrast time) without affecting their interpretation of the images, AI generally lacks that ability. For example, if an AI agent was trained only with images from a 3 Tesla MRI scanner, it cannot be guaranteed a priori that it will have the same results on scans performed at 1.5 Tesla. One solution is to develop mathematical processes to recognize, normalize and transform the data to minimize drift. Another approach to mitigate this phenomenon is to perform training and validation with "full" data sets, representing each type of image data acquisition and reconstruction.

In order to evaluate the diagnostic performance of an AI tool in a comprehensive manner and thus ensure its intended use, it is recommended to perform multicenter studies, which allow measuring this performance in different patient populations and different image acquisition protocols. The present multicenter study seeks to externally validate the performance of an AI tool (TRx v.1) as a diagnostic assistance tool for chest x-rays.

Study Design

Study Type:

Observational

Anticipated Enrollment :

385 participants

Observational Model:

Other

Time Perspective:

Prospective

Official Title:

Multicenter Validation Study of an Artificial Intelligence Tool for Automatic Classification of Chest X-rays

Actual Study Start Date :

Jul 1, 2021

Anticipated Primary Completion Date :

Feb 28, 2022

Anticipated Study Completion Date :

Jul 31, 2022

Outcome Measures

Primary Outcome Measures

Concordance between AI tool and reference standard [5 months]
The concordance between the category assigned by the professionals and that assigned by the algorithm will be analyzed. For this purpose, a diagnostic test will be evaluated for the detection of abnormality (i.e., the test is positive when at least one of the four types of findings is observed). Considering the specialists' diagnosis as a reference standard, the confusion matrix will be constructed and the diagnostic metrics of the AI tool (sensitivity, specificity and predictive values) will be calculated. The 95% confidence intervals will be calculated using exact binomial distribution.

Secondary Outcome Measures

Receiver Operating Characteristic curves [5 months]
Receiver Operating Characteristic curves will be constructed for the global category of abnormality and for each of the individual radiological findings, calculating in each case the Area Under the Curve (value between 0 and 1). A model whose predictions are 100% incorrect has an area under the curve of 0.0; another whose predictions are 100% correct has an area under the curve of 1.0. The categorization made by the expert radiologists will be taken as the reference standard. It will be evaluated whether there is a significant difference between the area under the curve of the AI tool and the reference value estimated for non-imaging physicians (i.e. emergency room physicians or residents). The De Long test with a significance level of 0.01 will be used.
Qualitative analysis [5 months]
The images with erroneous diagnoses (false negatives and false positives) and the corresponding heat maps generated by the algorithm will be studied individually.
Inter-observer concordance index [5 months]
The inter-observer concordance between the participating specialists will be analyzed. In cases where the image in question is categorized differently by each of the observers, they will be asked to review the images together to define a category.
Analysis by institution [5 months]
The variables of items 1. and 2. will be calculated separately for the images of each participating institution. We will evaluate if there is a significant difference in the different area under the curve values across institutions using the De Long test. A significance level of 0.01 will be used.

Eligibility Criteria

Criteria

Ages Eligible for Study:

18 Years and Older

Sexes Eligible for Study:

All

Inclusion Criteria:

X-rays that meet the following requirements will be included:

Chest X-ray
Belong to patients over 18 years of age.
Advocacy and digital acquisition
Study conducted in the aforementioned institutions and stored in their respective Picture Archiving and Communication System