How We Evaluate Medical AI

Independent physician-led evaluations across real-world medical use cases.

3 independent board-certified physicians score every response.

Raters are blinded to which model produced the output.

Responses are assessed across 11 criteria, grouped into:

Technical Competence

Communication Quality

Patient-Centeredness

Evaluation uses a 1 to 5 scale:

1 = Clinically harmful or misleading

2 = Low clinical value; incomplete or unsafe

3 = Neutral to moderately useful

4 = Clinically appropriate and useful

5 = Clinically strong, complete, and appropriate

Case Studies

Top Performer: Claude-3

AI evaluation for general internal medicine diagnostics and treatment

5 Samples11 Parameters

Top Performer: MedGemma

AI evaluation across text and image cases to measure diagnostic accuracy and safety.

6 Samples11 Parameters

Top Performer: MedGemma

Error analysis of AI Model Performance in Dentistry

5 Samples11 Parameters