I run large-scale empirical experiments on multi-agent AI safety. My work focuses on how AI systems interact with each other in ways that degrade safety properties - and what that means for the alignment of deployed agent pipelines. I use controlled experimental design, multi-model evaluation, and mechanistic interpretability to study these questions.
A controlled empirical study testing whether AI agents corrupt each other's safety alignment through persona-based social interaction. Designed multi-agent interaction protocols across multiple model families and safety conditions. Evaluated using LLM-as-judge methodology with robustness checks across judge models and prompt variations.
Theoretical and empirical framework for forecasting unintended capability gains from safety-focused research programs. Addresses whether safety research inadvertently advances the capabilities it aims to constrain.
Controlled fine-tuning experiments demonstrating that naive safety fine-tuning causes catastrophic reasoning collapse. Validates the alignment tax hypothesis using mechanistic interpretability techniques to trace feature degradation across model layers.
Open to research collaborations, paper feedback, and conversations about empirical AI safety methodology.