2025-09-11
I gave an AI agent full control over a clinical research data science task in a development environment. After accepting all edits for about an hour, it produced a model with AUC 0.67. When I asked why it chose certain features over others, it hallucinated evidence.
As AI agents become more autonomous (via tools like Weco, Cursor, or GitHub Copilot in āagent modeā), weāre sneaking around a space ofĀ untraceable decision-making. An AI Agentās ability to reason, iterate, and act can in many ways also makes them opaque. I am excited that coding barriers (Python, R, FHIR, etc.) are being broken down with the help of LLMs so that more clinicians can be involved in developing the technology that improves care, but I am also concerned that decision making may be offloaded to AI during development.
In healthcare, who is responsible when the developed model fails? The user? The AI? The AI provider?
The danger isnāt that AI agents areĀ badĀ at reasoning. Itās that they areĀ too goodĀ at simulating it and are merely reflecting rational thinking seen in their training data.
We need a new standard:Ā decision logs
Just as we require clinical trials to document every step, every hypothesis, every deviation, we must demand the same from AI agents. Every choice like feature selection, model architecture, hyperparameter tuning should be logged with:
- The rationale (even if speculative) - Logging of all āthinking stepsā
- The evidence (or lack thereof) - An agent can do this too, but sources need to be validated
- The modelās confidence level - Probabilities, internal assessments, LLM committee-as-judge frameworks
- The source of the data or insight - References that are true, valid, reliable, and represent the best of science
This isnāt about slowing down AI. Itās aboutĀ making it responsible.
If you develop a model that impacts a patient, we should be able to prove to the patient how it works, the steps taken to eliminate bias and harm, and the paths not chosen with clearly defined evidence.