The MedPerturb Dataset

What Gender, Stylistic, and Viewpoint Perturbations Reveal About Human and Clinical LLM Reasoning

Abinitha Gourabathina1, Yuexing Hao1,2, Walter Gerych1, Marzyeh Ghassemi1
Contact: abinitha@mit.edu

1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
2Department of Information Science, Cornell University, Ithaca, NY, USA

GitHub Hugging Face Download Data

Abstract

Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) stylistic variation (e.g., uncertain phrasing or colloquial tone); and (3) viewpoint reformulations (e.g., multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or viewpoint reflect diverging reasoning capabilities between humans and LLMs. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess alignment in clinical reasoning between humans and AI systems.

Figure 1 placeholder

Dataset Overview

# Original Data Source Perturbation Clinical Contexts
1OncQABaseline50
2Gender-Swapped50
3Gender-Removed50
4Uncertain50
5Colorful50
6r/AskaDocsBaseline50
7Gender-Swapped50
8Gender-Removed50
9Uncertain50
10Colorful50
11USMLE and DermVignette100
12Multiturn100
13Conversational100
Total Clinical Contexts800
Treatment Questions (3 per context)×3 = 2400
Total human reads (3 per question)×3 = 7,200
Total LLM reads (3 per question × 4 models)×4 = 28,800

Key Findings of Case Studies

  • LLMs tend to over-allocate resources and under-recommend self-management: They often favor more intensive medical interventions than necessary, potentially straining healthcare systems and misaligning with patient-centered care.
  • LLMs are more sensitive to gender and language style than humans: Minor wording or gender cues can shift model outputs, raising concerns about model reasoning and the relevance of non-clinical features in medical decision-making.
  • AI-generated clinical content can shift human decision-making: Clinicians shown AI-generated text tend to recommend more self-management and less resources, showing that AI content can influence treatment planning. Specifically, we look at LLM-generated summaries and multiturn conversations, which are key LLM tasks in clinical integration.

Hugging Face Dataset Card

MedPerturb on Hugging Face

The MedPerturb dataset contains 800 clinical vignettes systematically perturbed along gender, style, and viewpoint axes. Each context has 3 treatment questions and is annotated by humans and LLMs.

Use case: Robustness testing, fairness auditing, hallucination detection, clinical evaluation

Format: JSON, with metadata on perturbation type, model responses, and clinician judgments.