How can we make sure that deep learning models are actually doing what we want them to do? I study the foundations of robust and reliable machine learning: how to understand, debug, and guarantee the behavior of machine learning models. My research interests span machine learning, optimization, and robustness, in order to develop principled methods with an eye towards scalability and practicality in real-world settings such as cosmology, surgery, cardiology, and sepsis. My work is organized around three themes: Adversarial Safety, Interpretability with Guarantees, and Formal Assurances for Foundation Models.

Adversarial Safety

How do we defend models against adversarial threats, from perturbation attacks to jailbreaks to misuse? Topics include mechanistic theory of safety, alignment & control, jailbreaking & LLM defenses, and adversarial robustness.

Mechanistic Theory of Safety

We develop theoretical frameworks that mechanistically explain how models follow and break rules, connecting attention mechanisms and logical inference to safety behaviors.

Alignment & Control

How do we control model behavior to prevent misuse and remove unwanted knowledge? We develop methods for machine unlearning, misuse mitigation, and cross-cultural alignment.

Jailbreaking & LLM Defenses

We study how adversarial prompts bypass LLM safety guardrails and develop principled defenses, spanning automated attack algorithms, smoothing-based defenses, and standardized benchmarks.

Adversarial Robustness

We develop both empirical and provably certified defenses against adversarial perturbations, spanning threat models from Lp norms to patches to learned perturbation sets.

Show all Adversarial Safety papers

Interpretability with Guarantees

What do model explanations actually mean? We build principled methods for understanding, debugging, and explaining ML models. Topics include concepts & structure, scientific & healthcare applications, certified explanations, and debugging.

Concepts & Structure

What concepts do models learn, and how do they compose? We develop methods for extracting expert-aligned features, compositional concept representations, and topic-based explanations across domains and languages.

Scientific & Healthcare Applications

We apply interpretable ML methods to high-stakes scientific and medical settings, including treatment effect estimation, anomaly repair with formal guarantees, and clinical prediction.

Certified Explanations

Can we trust model explanations? We develop explanation methods with provable guarantees, including certified stability for feature attributions and faithful group-based models.

Debugging

We build tools for diagnosing model failures, from sparse linear layers for debuggable networks to methods for detecting spurious correlations, data biases, and transfer learning pathologies.

Show all Interpretability papers

Formal Assurances for Foundation Models

How can we bring formal structure and provable guarantees to foundation models through programs, logic, and verification? Topics include verifying & faithful reasoning and neurosymbolic learning.

Verifying & Faithful Reasoning

Can we verify that a model's reasoning is sound? We develop methods for certifying reasoning chain correctness, ensuring chain-of-thought faithfulness, and checking whether models follow statistical rules inferred from data.

Neurosymbolic Learning

We combine neural networks with symbolic programs to enable scalable, data-efficient learning with formal structure, from programmable frameworks to compositional tensor methods.

Show all Formal Assurances papers