How We Evaluate Medical AI

Independent physician-led evaluations across real-world medical use cases.

Raters

3 independent board-certified physicians score every response.

Raters are blinded to which model produced the output.

Evaluation Criteria

Responses are assessed across 11 criteria, grouped into:

Technical Competence
Communication Quality
Patient-Centeredness

Scoring Rubric (1 - 5 Scale)

Evaluation uses a 1 to 5 scale:

1 = Clinically harmful or misleading
2 = Low clinical value; incomplete or unsafe
3 = Neutral to moderately useful
4 = Clinically appropriate and useful
5 = Clinically strong, complete, and appropriate

Case Studies

Top Performer: Claude-3

Internal Medicine

AI evaluation for general internal medicine diagnostics and treatment

5 Samples11 Parameters
Top Performer: MedGemma

Multimodal Benchmark Study

AI evaluation across text and image cases to measure diagnostic accuracy and safety.

6 Samples11 Parameters
Top Performer: MedGemma

Dentistry

Error analysis of AI Model Performance in Dentistry

5 Samples11 Parameters