Variables and attributes
The ARMD dataset26 is available at Dryad and encompasses a wide range of variables that are organized into multiple linked tables, each offering a unique perspective on a patient’s microbiological, demographic, and clinical characteristics. To facilitate downstream analyses, the dataset includes tables on implied antibiotic susceptibility relationships and rules applied for inferring susceptibility where direct testing was not available. Researchers can also leverage longitudinal data, capturing the timing of infections, prior medical procedures, and medication exposures relative to culture orders, enabling temporal analyses.
At the core of ARMD is the microbiological cultures cohort, which includes details about culture types—urine, respiratory, and blood cultures—along with the identified organisms and their antibiotic susceptibilities. Antibiotic susceptibility results were included for 55 antibiotics and categorized into five groups: susceptible, resistant, intermediate, inconclusive, and synergism. Synergism refers to cases where the interaction between two antibiotics results in an enhanced effect, meaning the combined treatment is more effective than either antibiotic alone. This category captures instances labeled as “Synergy” or “No Synergy” in the dataset. Additional features include the culture’s ordering mode (inpatient or outpatient) and the order’s timing.
The dataset situates each culture event within its clinical context. The ward information provides insights into the care environment where cultures were collected, distinguishing between inpatient wards, intensive care units (ICU), emergency departments (ED), and outpatient clinics.
To capture potential influences on culture outcomes, ARMD includes records of prior antibiotic exposures. This component details the antibiotic name, class, and subtype, enabling analyses of how previous treatments may affect organism susceptibility and resistance development. The timing of these exposures relative to culture collection is recorded, supporting studies on the impact of prior antibiotic use on resistance development. Additionally, the dataset tracks microbial resistance trends on both individual and population levels over time, recording the evolution of resistance relative to culture events for specific organisms and antibiotics. Historical infection data are captured through the inclusion of a prior infecting organism table, which documents organisms identified in previous cultures for each patient. This enables longitudinal analyses of infection recurrence and its potential influence on current antimicrobial resistance. The table records the identified organism and the timing of the prior infection relative to each collected culture.
Patient demographics offer an essential context for stratifying analyses by age (binned into predefined ranges) and sex (binary-coded). In addition, the dataset incorporates socio-environmental factors through the inclusion of ADI scores, which capture neighborhood-level socioeconomic characteristics based on patient ZIP codes from the Neighborhood Atlas27. ADI scores designed for 9-digit ZIP codes account for factors such as income, education, employment, and housing quality, providing a broader context for understanding disparities in AMR risk. For records with only 5-digit ZIP codes, missing ADI scores were replaced with the average ADI score calculated from 9-digit ZIP codes sharing the same first 5 digits. For other cases with invalid or unavailable ADI scores (e.g., marked as P, U, or NA), no imputation was performed, and these entries were left as null values in the dataset.
Recognizing the role of long-term care facilities in AMR dynamics, nursing home visits are also documented, specifying the number of days between visits and culture orders, up to 90 days, to highlight potential risk factors for resistant infections.
Comprehensive laboratory data are integrated into the dataset, capturing key clinical measurements taken around the time of each culture order. Variables include white blood cell count, hemoglobin, creatinine, lactate, and procalcitonin, among other routinely collected studies. Each metric is summarized using statistical descriptors such as medians, quartiles (Q25, Q75), and first and last recorded values. Furthermore, vital sign data—including heart rate, blood pressure, temperature, and respiratory rate—provide additional clinical context, enabling analyses of physiological responses to infection.
Comorbid conditions are mapped using standardized indices such as the Elixhauser Comorbidity Index28 and the Agency for Healthcare Research and Quality (AHRQ) Clinical Classifications Software Refined (CCSR)29. Each comorbidity is timestamped relative to the culture. Notably, ongoing comorbidities are flagged using NULL values in the end date field, indicating that the condition was active at the time of culture collection. These NULL values do not represent missing data or the absence of the condition. Additionally, procedural history is also provided, with records of medical procedures (e.g., central venous catheter placements, mechanical ventilation) performed prior to culture orders, derived from Current Procedural Terminology (CPT) codes.
Lastly, the implied susceptibility table infers antibiotic susceptibility for drugs not tested using an extensive set of predefined rules. This table captures cases where susceptibility to one antibiotic can imply susceptibility or resistance to another, based on established microbiological and pharmacological principles. The table is designed to enhance the interpretability of susceptibility data by incorporating implied relationships between antibiotics, which can be critical for guiding clinical decision-making and understanding resistance patterns. Additionally, we share the rules applied to derive these implied relationships, providing transparency and enabling researchers to understand and reproduce the logic behind the inferred data. This derived table leverages microbiological principles to capture relationships between antibiotics.
Demographics and microbiological culture data
ARMD comprises 751,075 microbiological culture records collected from 283,715 unique patients. Urine cultures constitute the majority of samples (50.0%), blood cultures represent 38.8%, and respiratory cultures account for 11.3%. The dataset spans from December 1999 to February 2024; however, there is a noticeable increase in recorded culture orders starting in 2008. This shift aligns with Stanford’s adoption of Epic as the EHR system, which significantly improved data collection and documentation.
The patient population demonstrates a broad age distribution, as illustrated in Fig. 2, with an average age of 56.7 years. The sex distribution within the cohort reveals a predominance of female patients, accounting for 66.9% (189,864 patients) of the total population, and male patients form 33.0% (93,763 patients), while a minimal fraction (0.03%, n = 82) have an unknown sex designation.

A histogram showing the age distribution of patients within the ARMD dataset.
Figure 3 presents the annual distribution of the top five most common organisms identified in urine, blood, and respiratory cultures from 2013 to 2023. In urine cultures (Fig. 3a), Escherichia coli (E. coli) is the predominant pathogen, consistently accounting for more than 60% of isolates. Klebsiella pneumoniae and Proteus mirabilis are the next most frequently detected organisms, with little variation over time. This stability in distribution indicates a consistent microbiological profile for UTIs within the cohort, consistent with established epidemiological trends nationwide30,31,32.
Distribution of the top five most common bacterial organisms identified in urine, blood, and respiratory cultures over time (2013–2023). The stacked bar charts show the relative percentage of each organism by year, with an additional “Other” category aggregating all less frequent isolates. The organisms show different prevalence patterns across culture types, reflecting variations in infection sources and microbial ecology.
In blood cultures (Fig. 3b), a more diverse range of pathogens is observed compared to urine cultures. While E. coli remains the most common pathogen, Staphylococcus aureus and coagulase-negative staphylococci are more prevalent, reflecting the tendency of gram positive cocci to cause bloodstream infections.
In respiratory cultures (Fig. 3c), Pseudomonas aeruginosa is the most frequently isolated pathogen, possibly related to selection bias among patients who underwent respiratory culture testing from either non-invasive (e.g., induced sputum) or invasive (e.g., bronchoalveolar lavage) methods. A distinction between mucoid and non-mucoid Pseudomonas aeruginosa is observed, likely reflecting changes in microbiology reporting standards. Mucoid strains are clinically significant, particularly in chronic respiratory infections. Other notable organisms include Klebsiella pneumoniae and Staphylococcus aureus, both of which remain stable contributors to respiratory infections throughout the study period.
link

