“Is AI going to replace our job as physicians?” “I have many retinal images. Can I use them to train a robust AI system?” For the past 12 months, I have been getting these questions and others from many of my colleagues. Deep learning, part of the artificial intelligence methods used to interpret ophthalmic images, has sparked tremendous interest in the machine learning field over the last few years. It has revolutionized the computer vision field and achieved substantial jumps in diagnostic performance for image recognition, speech recognition, and natural language processing. This technique also has been investigated in different medical specialties, including ophthalmology, for detection of diabetic retinopathy, glaucoma, age-related macular degeneration and cardiovascular risk factors.

Now deep learning is extending to interpretation of images from the youngest  patients. It is recommended that babies born at less than 31 weeks or with birth weight less than 1.25kg be screened for retinopathy of prematurity (ROP). Some centers may even extend the screening criteria to babies born at less than 32 weeks or with birth weight less than 1.5kg, or those who require substantial oxygen support owing to cardiorespiratory issues. Typically, the first screening is done either at week 32 or 4 to 6 weeks after birth (whichever is later), as there is limited value in performing screening prior to that.

In Singapore where I work, ROP screenings often take place in the neonatal intensive care units, high dependency units, or special care nurseries.  They usually are performed by a pediatric ophthalmologist or retinal specialist. Tele-ROP screening using retinal photography has been proposed to be an alternative and effective screening method for ROP. The Imaging and Informatics in ROP (i-ROP) and Stanford University Network for Diagnosis of ROP (SUNDROP) are 2 major tele-ROP screening networks that were started in 2011. Outside the United States, countries with many rural areas, including China and India, also have set up telemedicine networks to serve rural patients who otherwise have limited access to tertiary eye care services  to try to prevent ROP-related blindness.

Brown et al recently described the use of deep learning to detect plus diseases in ROP. This is an important clinical question as it will determine whether an infant requires any intervention for the ROP. To me, this is an exciting piece of work, as machine learning may have the potential to help screen millions of premature babies worldwide to prevent childhood blindness. Specifically, this study used a total of 5511 Retcam retinal photographs collected over a 5-year period from 8 academic institutions for training and 100 images for 5-fold cross validation, using U-net (for vessel segmentation) and pretrained Inception-V1. For the training set, the gold standards were: image-level diagnosis, 3 experts; patient-level diagnosis,  1 expert; and prevalence of normal vs pre-plus vs plus disease: 82% vs 17% vs 3%, respectively. The trained algorithm was compared against the gold standards, as well as 8 independent well-regarded experts in the field who had vast experience in managing ROP clinically. It is enlightening to learn that using an independent sample of 100 retinal images, the algorithm had a robust sensitivity and specificity for detection of plus and pre-plus diseases (plus: 93% sensitivity and 94% specificity; pre-plus or worse: 100% sensitivity and 94% specificity).

Does this mean that moving forward we can start deploying AI widely to screen for ROP? My personal feeling is not yet, for a few reasons. First, the study by Brown et al has largely trained and tested the AI algorithm within the i-ROP setting. Whether this AI algorithm can be generalized to be used in other pediatric cohorts, within and outside the United States, still remains uncertain. Second, AI training and testing using retinal images is often subject to numerous variabilities, including width of field, field of view, image magnification, image quality, and participant ethnicities. Generally speaking, the reporting of AI performance for medical imaging can be divided into 3 phases: training, validation, and testing. As a common practice, a 1:1 match (positive vs negative) often occurs within the training and validation phase that could happen in several folds. Many AI papers have reported the areas under the curve generated within this phase. Although the “overfitting” issue (same image used twice in training and validation) can be avoided in this phase, the safest approach to test the generalizability of an AI algorithm is to test it in separate independent test sets that consist of a diverse range of retinal images from different populations. To ensure that the AI algorithm is powered sufficiently to detect the abnormal images from the normal images within the real-world setting, a power calculation should take the following into consideration: the prevalence of the disease, type I and II errors, confidence intervals, and desired precision. It is important to first pre-set the desired operating threshold on the training set, followed by analysis of performance metrics such as sensitivity and specificity on the test set.

AI can be perceived as a threat or a friend to mankind. To many, it is indeed the fourth industrial revolution that already has changed how we think and live. It is exciting, yet scary to some. Many studies have reported AI to be at least comparable, if not better and more sustainable, than a human workforce. As clinicians, we should embrace AI early to help tackle the manpower and financial constraints of many clinical problems, in particular for underprivileged countries with limited health care resources.

ACKNOWLEDGEMENT

I thank Sonal K. Farzavandi, FRCS (Ed) Singapore National Eye Center, and T. Y. Alvin Liu, MD, Wilmer Eye Institute, for their contributions to this blog post.

Disclosure: Dr Ting is a co-inventor of a deep learning system to screen for retinal diseases

Join the Discussion