How can we make sure that deep learning models are actually doing what we want them to do? I study the foundations of robust and reliable machine learning: how to understand, debug, and guarantee the behavior of machine learning models. My research interests span machine learning, optimization, and robustness, in order to develop principled methods with an eye towards scalability and practicality in real-world settings such as cosmology, surgery, cardiology, and sepsis. My work is organized around three themes: Adversarial Safety, Interpretability with Guarantees, and Formal Assurances for Foundation Models.
Adversarial Safety
How do we defend models against adversarial threats, from perturbation attacks to jailbreaks to misuse? Topics include mechanistic theory of safety, alignment & control, jailbreaking & LLM defenses, and adversarial robustness.
We develop theoretical frameworks that mechanistically explain how models follow and break rules, connecting attention mechanisms and logical inference to safety behaviors.
How do we control model behavior to prevent misuse and remove unwanted knowledge? We develop methods for machine unlearning, misuse mitigation, and cross-cultural alignment.
-
Artifact or Flaw? Rethinking Prompt Sensitivity in Evaluating LLMsEMNLP Findings 2025
-
Avoiding Copyright Infringement via Machine UnlearningNAACL-Findings 2025
-
Comparing Styles across LanguagesEMNLP 2023
We study how adversarial prompts bypass LLM safety guardrails and develop principled defenses, spanning automated attack algorithms, smoothing-based defenses, and standardized benchmarks.
-
Adversarial Prompting for Black Box Foundation ModelsDLSP 2023 Keynote
We develop both empirical and provably certified defenses against adversarial perturbations, spanning threat models from Lp norms to patches to learned perturbation sets.
-
Wasserstein adversarial examplesICML 2019
-
Scaling provable adversarial defensesNeurIPS 2018
Interpretability with Guarantees
What do model explanations actually mean? We build principled methods for understanding, debugging, and explaining ML models. Topics include concepts & structure, scientific & healthcare applications, certified explanations, and debugging.
What concepts do models learn, and how do they compose? We develop methods for extracting expert-aligned features, compositional concept representations, and topic-based explanations across domains and languages.
-
Comparing Styles across LanguagesEMNLP 2023
-
TopEx: Topic-based Explanations for Model ComparisonICLR 2023, Tiny Papers
We apply interpretable ML methods to high-stakes scientific and medical settings, including treatment effect estimation, anomaly repair with formal guarantees, and clinical prediction.
Can we trust model explanations? We develop explanation methods with provable guarantees, including certified stability for feature attributions and faithful group-based models.
-
Evaluating Groups of Features via Consistency, Contiguity, and StabilityICLR 2024, Tiny Papers (Oral)
We build tools for diagnosing model failures, from sparse linear layers for debuggable networks to methods for detecting spurious correlations, data biases, and transfer learning pathologies.
-
Missingness bias in model debuggingICLR 2022
-
Leveraging Sparse Linear Layers for Debuggable Deep NetworksICML 2021 (Oral)
Formal Assurances for Foundation Models
How can we bring formal structure and provable guarantees to foundation models through programs, logic, and verification? Topics include verifying & faithful reasoning and neurosymbolic learning.
Can we verify that a model's reasoning is sound? We develop methods for certifying reasoning chain correctness, ensuring chain-of-thought faithfulness, and checking whether models follow statistical rules inferred from data.
-
Faithful Chain-of-Thought ReasoningIJCNLP-AACL, 2023
We combine neural networks with symbolic programs to enable scalable, data-efficient learning with formal structure, from programmable frameworks to compositional tensor methods.
-
DOLPHIN: A Programmable Framework for Scalable Neurosymbolic LearningICML 2025
-
Data-Efficient Learning with Neural ProgramsNeurIPS 2024