Project 3 Example: Human-AI Collaboration Tester (HAICT) Exp. 7

Sponsor

Brigham and Women's Hospital (Other)

Overall Status

Recruiting

CT.gov ID

NCT05272189

Collaborator

(none)

Enrollment

Location

Arm

Anticipated Duration (Months)

0.2

Patients Per Site Per Month

Study Details

Study Description

Brief Summary

The study is one part of a "bundle" of experiments that constitute Project Three of a National Eye Institute grant. Project Three includes a series of experiments that investigate how changing the input from a simulated AI can affect the decisions made by human observers in a two-alternative forced choice task (like the decision to recall a woman for further examination in mammography). HAICT 7, the experiment described here, investigates how changing prevalence affects human performance when AI is used as a Second Reader.

Condition or Disease	Intervention/Treatment	Phase
Decision Making Computer Aided Diagnosis	Behavioral: Simulated Second Reader AI Behavioral: Target Prevalence	N/A

Detailed Description

This text is the text of the pre-registration for the HAICT 7 experiment as described on the Open Science Framework. https://osf.io/hngu4/

NOTE: This study is representative of studies conducted in Project 3 of this grant. There are multiple experiments in the bundle of experiments represented by Project 3 but it is not possible to register a bundle of studies on CT.gov.

NOTE: Since the pronoun comment is advisory, we will leave it for now.

Human-AI Collaboration Tester (HAICT) Exp. 7 (lightly edited from OSF)

Data collection. Have any data been collected for this study already? (Yes/No)

yes

Hypothesis. What's the main question being asked or hypothesis being tested in this study?

Background: In a variety of search experiments, both basic and clinical, the data have been consistent with a situation where the variability of the signal (or target) is greater than the variability of the noise (distractors). The classic sign of this is a zROC function with a slope < 1 - typically around 0.6. A slope of 1.0 is indicative of an equal variance 2AFC task. For the HAICT task that we have been testing, we would expect equal variance, but we think it would be worth checking so we will systematically vary prevalence which will shift criterion. That will sweep out an ROC curve that we can examine.

We will also test the Second Reader faux-AI in order to determine if low prevalence makes Second Reader worse.

(H1): We expect to replicate the finding that human criteria become more conservative as prevalence declines.
(H2): We predict that the slope of the resulting zROC will be 1.0.
(H3): We hypothesize that low prevalence will make Second Reader AI less effective because the positive predictive value of its comments will be low.

Dependent variable. Describe the key dependent variable(s) specifying how they will be measured.

The main dependent variables of interest are accuracy (and the signal detection derivatives of accuracy, d' and c), reaction time, and subjective ratings on the survey following each block.

Conditions. How many and which conditions will participants be assigned to?

This series of experiments investigates how changing the input from a simulated AI can affect the decisions made by human observers in a two-alternative forced choice task (like the decision to recall a woman for further examination in mammography). We have developed a paradigm called the Human-AI Collaboration Tester (HAICT) that allows for efficient testing of interactions between a human and a simulated AI.

The observers' task in all conditions is to give a 2AFC decision about whether a stimulus is "bad" or "not bad." To use language roughly mimicking a medical diagnosis, each stimulus is referred to as a "case." Observers are asked to make a 2AFC decision about arrays of colored shapes. The decision is made based on the predominant color of the case. The number of elements of each color are drawn from one of two normal distributions, one for positive (bad) stimuli and the other for negative (not bad) stimuli.

The results from previous HAICT experiments (3 and 4) showed that human performance in the Second Reader condition drops off significantly at low prevalence. Performance in the Second Reader condition was better than Baseline when the prevalence of bad cases was 50% but was significantly worse than Baseline when prevalence was only 10%. In this experiment, we manipulate the prevalence of "bad" cases in the Second Reader and Baseline conditions. Four different prevalence rates will be tested - 10%, 33%, 67%, and 90%. Observers will complete 8 blocks (2 AI rules x 4 prevalence rates), and block order is random.

AI rules to be tested:

Baseline - No AI input. Observer classifies each case as "bad" or "not" bad on their own.
Second Reader - The observer makes an initial decision about every case. The AI silently classifies stimuli using a conservative criterion (c = 0.5). The logic for the conservative criterion is that the second reader is being used to cut down on false positive responses and so it is intended to question positive human responses that might be marginal. If the observer and AI disagree, then the AI informs the human observer. The observer is then given a chance to either change their response or go with their first opinion.

As in Experiments 1-5, the AI d-prime is fixed at 2.2. Feedback is known to increase the prevalence effect, so feedback will be given in both the practice and the test trials. Observers will complete 20 practice trials and 200 test trials in each block. Immediately after each block is completed, observers will be shown a summary of their performance. After the Second Reader blocks, they will also be asked to answer three subjective questions about the usefulness of the AI (see "Files" for more details).

Analyses. Specify exactly which analyses you will conduct to examine the main question/hypothesis.

First, we summarize the number of hits, true negatives, misses, and false alarms in each block. From this, we can calculate the accuracy, the positive predictive value, sensitivity (d-prime), and the criterion for each observer under each of the different conditions. Given measures of performance at 4 levels of prevalence, we can estimate the ROC curve (pHit x pFA) and the zROC function (zHit x zFA). We will test the hypothesis that the slope of the zROC is equal to 1 (the consequence of an equal variance 2AFC task).

More analyses. Any secondary analyses?

We will look to see if the observers' subjective opinions about the AI are correlated with variables such as the empirical d-prime, or the positive predictive value.

Sample size. How many observations will be collected or what will determine sample size? No need to justify decision, but be precise about exactly how the number will be determined.

We will test 12 observers. This is consistent with the sample sizes of previous experiments.

Other. Is there anything else that you would like to pre-register? (e.g., data exclusions, variables collected for exploratory purposes, unusual analyses planned?)

N/A

Study Design

Study Type:

Interventional

Anticipated Enrollment :

15 participants

Allocation:

N/A

Intervention Model:

Single Group Assignment

Masking:

None (Open Label)

Masking Description:

Participants are naive to the purposes of the study but they are not blinded to the conditions.

Primary Purpose:

Basic Science

Official Title:

Project 3 Example: Human-AI Collaboration Tester (HAICT) Exp. 7

Actual Study Start Date :

Jan 1, 2020

Anticipated Primary Completion Date :

Jan 1, 2024

Anticipated Study Completion Date :

Jan 1, 2025

Arms and Interventions

Arm	Intervention/Treatment
Experimental: Experiment All participants are tested in all conditions of this experiment.	Behavioral: Simulated Second Reader AI In this experiment, in some conditions, the participant makes their decision in the presence of information about a simulated artificial intelligence decision. Behavioral: Target Prevalence The frequency with which targets are presented varies from 10% to 90% Other Names: Base Rate

Arm

Intervention/Treatment

Experimental: Experiment

All participants are tested in all conditions of this experiment.

Behavioral: Simulated Second Reader AI

In this experiment, in some conditions, the participant makes their decision in the presence of information about a simulated artificial intelligence decision.

Behavioral: Target Prevalence

The frequency with which targets are presented varies from 10% to 90%

Other Names:

Base Rate

Outcome Measures

Primary Outcome Measures

D' [Up to one week]
D' (d-prime) is the signal detection theory measure of the level of performance on a task.
Criterion [Up to one week]
Criterion is the signal detection theory measure of the bias ("liberal" or "conservative") of observers' decisions

Secondary Outcome Measures

Reaction Time [Up to one week]
This is the measure of how long it takes to make a response.

Eligibility Criteria

Criteria

Ages Eligible for Study:

18 Years and Older

Sexes Eligible for Study:

All

Accepts Healthy Volunteers:

Yes

Inclusion Criteria:

- All welcome to enroll on line

Exclusion Criteria:

Must pass the Ishihara color vision screening test
20/25 vision (with correction)

Contacts and Locations

Locations

	Site	City	State	Country	Postal Code
1	Visual Attention Lab / Brigham and Women's Hospital	Boston	Massachusetts	United States	02215

Sponsors and Collaborators

Brigham and Women's Hospital

Investigators

Principal Investigator: Jeremy M Wolfe, PhD, Brigham and Women's Hospital

Study Documents (Full-Text)

None provided.

More Information

Publications

None provided.

Responsible Party:

Jeremy M Wolfe, PhD, Professor, Brigham and Women's Hospital

ClinicalTrials.gov Identifier:

NCT05272189

Other Study ID Numbers:

2007P000646-B

First Posted:

Mar 9, 2022

Last Update Posted:

Mar 9, 2022

Last Verified:

Feb 1, 2022

Individual Participant Data (IPD) Sharing Statement:

Yes

Plan to Share IPD:

Yes

Studies a U.S. FDA-regulated Drug Product:

Studies a U.S. FDA-regulated Device Product:

Keywords provided by Jeremy M Wolfe, PhD, Professor, Brigham and Women's Hospital

Signal detection

VIsual Perception

BESH

Additional relevant MeSH terms:

Disease

Study Results

No Results Posted as of Mar 9, 2022