Accurate, cited answers from 40 years of protocols
The hospital network had accumulated 40 years of clinical protocols, drug interaction databases, treatment guidelines, care pathway documents, and formulary updates. The corpus comprised 2.4 million documents totaling over 8 terabytes, housed across 14 separate systems with inconsistent metadata and no unified search. 12,000 clinicians — physicians, nurses, pharmacists, and clinical specialists — needed to access this information at the point of care. The existing solution was a SharePoint-based search that returned keyword matches: functionally unusable for clinical queries that used natural language and required semantic understanding.
The system could not hallucinate. In the hospital's threat model, a response that confidently cited an incorrect drug interaction or a superseded dosing protocol was categorically unacceptable. Unlike consumer AI applications where a wrong answer is an inconvenience, in a clinical setting a wrong answer is a patient safety event. This constraint shaped every architectural decision. The system needed to retrieve and cite, not to generate and infer. Every response needed source attribution with enough specificity — document name, section, page — that a clinician could verify the source in under 30 seconds if they chose to.
"A response that confidently cites an incorrect drug interaction is not an inconvenience — it is a patient safety event. The system could not hallucinate."
We built a retrieval system combining BM25 keyword search with dense semantic retrieval, merged via a reciprocal rank fusion algorithm and re-ranked by a cross-encoder trained on clinical query-document pairs labeled by clinical informaticists. The dual retrieval approach handles two distinct query types that clinical staff issue: keyword-anchored factual queries ("metformin contraindications CKD stage 3") where BM25 excels, and natural language clinical reasoning queries ("what's the protocol for chest pain presentation in a patient with prior CABG on anticoagulation") where dense retrieval is required. The merger ensures neither query type degrades the other.
Every response includes structured citations: document title, version date, section heading, page number, and a direct link to the source document in the document management system. The citation is generated alongside the response — not as an afterthought, but as a first-class output of the retrieval pipeline. Clinicians can verify any response in under 30 seconds by clicking through to the source. The system's behavior in uncertain cases is to surface multiple relevant documents and let the clinician synthesize, rather than to generate a synthesized answer that obscures the source uncertainty.
Queries that fall below a confidence threshold — measured by retrieval score distribution and cross-encoder confidence — are routed to a human clinical informatics team rather than returned with a low-confidence automated response. The threshold was calibrated by reviewing 400 queries where the system's confidence was borderline and having clinical informaticists label the correct handling. This escalation path is not a failure mode — it's a designed feature. In the system's first 6 months, 2.3% of queries were escalated. Clinician feedback on escalated queries was uniformly positive: they preferred a transparent "I'm not confident, here's a human" to a confidently wrong automated answer.
"2.3% of queries were escalated to human review in the first 6 months. Clinicians preferred transparent uncertainty to confident errors — every time."
Precision at k=5 (did the retrieved context contain the information needed to answer the query?) improved from 12% with the previous SharePoint system to 91% with the hybrid retrieval system — a 10× improvement in the metric that most directly predicts clinical utility. Average query response time is 1.8 seconds end-to-end, including retrieval, re-ranking, and response generation. Clinicians report this is fast enough to use at the point of care without disrupting workflow — a threshold they placed at under 3 seconds based on user research conducted before the engagement.
The hospital's clinical informatics team conducted a structured audit over 6 months: 3,200 system responses were reviewed by clinical staff against source documents. Zero fabricated facts were identified. In every case, the system's response was either directly grounded in a retrieved document or was an escalation to human review. This outcome reflects the architectural decision to retrieve-and-cite rather than generate-and-synthesize. The system does not produce clinical conclusions — it surfaces relevant clinical documentation and attributes it accurately. Clinical judgment remains with the clinician.