The Emerging Role of AI in Primary Eye Care

We can use age-related macular degeneration (AMD) as an example of AI application. Using fundus photography, we can ask a classification question: “Does this patient have AMD?” For the regression/prediction problem, we can ask: “What is the patient’s expected visual acuity?” (Figure 1). This represents a relatively simple clinical problem. However, with its ability to efficiently integrate large volumes of clinical data, the role of AI may be expanded to tackle more complex presentations that may be cognitively burdensome for the human clinician.

Glaucoma represents an interesting clinical problem as it uniquely suffers from both underdiagnosis and overdiagnosis.4 This diagnostic dilemma persists despite the wealth of clinical data for patients. In brief, data collected as part of a comprehensive, routine glaucoma examination include a clinical history, intraocular pressure (IOP), corneal thickness, gonioscopy/anterior chamber angle parameters, perimetry data, fundus examination and photography, and retinal imaging.5 Within these clinical techniques, there is also a dearth of qualitative and quantitative data used by clinicians for clinical decision making.

There has been a considerable range in outcomes, such as sensitivity, specificity, and accuracy, for glaucoma diagnosis when using individual techniques. In general, these outcomes range from 85–95%.4,6-8Of note, these studies would often compare the diagnostic ability of ophthalmologists in glaucoma diagnosis using these images. Often, ophthalmologists would perform poorer in comparison to an AI system.9 However, there are several caveats here. First, the apparent superior performance of AI relative to a group of ophthalmologists may be a representation of better internal consistency. Second, diagnostic outcomes such as sensitivity and specificity are wholly dependent on the underlying ground truth data set. Such data sets are often carefully curated for internal consistency, with stringent requirements on agreement. Third, using techniques, such as visual field results and a single fundus photograph in isolation is not a true representation of routine glaucoma care. Specifically, visual field data is often subjective, and fundus photography is vulnerable to subjective interpretation. This outcome is, therefore, not surprising.

Although the studies referenced above tend to report high diagnostic accuracy using AI approaches, there is also an effect of disease severity, whereby classifiers have higher accuracy with more severe stages of glaucoma compared to early disease.14 This is also consistent with diagnostic challenges in clinical practice. Since moderate or severe levels of glaucoma demonstrate overt signs of disease, the application of AI is questionable, as it does not clearly distinguish between early disease and glaucoma suspects.

In that vein, several studies have investigated the role of AI in differential diagnosis. Again, reported results tend to demonstrate a high ability of AI to differentiate between glaucoma and non-glaucomatous optic neuropathy. Another advantage of AI in this context is improved internal consistency compared to human clinicians. However, these studies have similar limitations to those discussed above: they are often pre-seeded with ‘obvious’ cases that are not diagnostic challenges. There are also limited data sets overall, and non-glaucomatous optic neuropathies are often grouped together, rather than being regarded distinctly.

Despite the high levels of diagnostic accuracy reported internally within studies, AI studies in general suffer from issues of external validity. External validity refers to the application of the model to data sets outside of the experimental condition, and is a method for assessing real-world applicability. Unfortunately, there is a tendency for studies to report high interval validity, but much poorer external validity. Several reasons for this have been described, including differences in demographics of the data set, the difference in rigour for obtaining the data, and reliability of external data sets. Reducing diagnostic accuracy by 10–20 percentage points significantly limits the deployment of AI models in clinical settings.

Although AI can integrate multimodal clinical data, to date, the results for predicting progression and risk have been more modest compared to diagnostic questions. In general terms, short-term progression and risk, e.g. months to a few years, may be assessable using AI, but after this short time period, the precision and accuracy of predictions worsen dramatically. Studies have used a prediction approach to stratify or personalise care by making predictions on glaucoma severity given different levels of intraocular pressure (IOP) control. Again, estimates tend to be more robust in the short term, but are likely to have limited clinical utility in the long term.

One potential role of LLMs is to answer patient questions related to disease. This has been explored in a variety of ophthalmic conditions. Our research group has recently examined the role of LLMs in glaucoma and AMD.15,16 We used a panel of experts to evaluate the LLM responses to frequently asked questions related to these diseases and judged them on cohesion, accuracy, comprehensiveness, and safety. Although most responses were adequate, we raised several major concerns, including out-of-date information, irrelevant information, and notable omissions in the responses.

We and others have flagged key limitations in the use of LLMs in a patient-facing capacity.15-17 First, depending on the LLM, some of its source material may be out of date. At best, this may make the information irrelevant, but it may also be dangerous, depending on the context. Second, it lacks comprehensiveness. Key phenotypes and details of subtypes of diseases are often neglected, such as secondary open angle glaucoma. The issue is that patient-users are unlikely to be experts in the field and will likely fail to recognise important omissions in information. Third, it is relatively impersonal. Short of providing actual clinical data inputs into the LLM, the responses are not tailored to individual patients, which limits utility.

The recognition of these limitations has led to advancements in the field. Explainable AI is an emerging approach to reconciling the ‘black box’ effect of AI. For example, AI models in glaucoma may use approaches to flag features of interest to explain the outputs, such as the inferotemporal vulnerability zone, the optic cup, and the trajectory of the retinal nerve fibre layer. Regulatory bodies, such as the United States Food and Drug Administration and Australia’s Therapeutic Goods Administration, are releasing recommendations and position statements on the use of AI in clinical practice.

However, several critical issues remain unresolved. These include liability/culpability, accountability, and cost of deployment. For example, if a clinician relies on an AI system for diagnosis and management, where does the responsibility lie? There are also issues with patient confidentiality and whether informed consent procedures adequately cover the potential unauthorised dissemination and use of sensitive patient information when uploaded into AI systems. We have recently raised this issue as well.20