Papers
Submitted · ACL 2026 Student Research Workshop
Does Inter-Agent Persona Exposure Erode Safety Alignment?
Aditya Raj
52,000+ controlled LLM API calls

A controlled empirical study testing whether AI agents corrupt each other's safety alignment through persona-based social interaction. Designed multi-agent interaction protocols across multiple model families and safety conditions. Evaluated using LLM-as-judge methodology with robustness checks across judge models and prompt variations.

multi-agent interaction persona exposure alignment robustness LLM-as-judge controlled experiment multi-model evaluation
SPAR Fellowship · Ongoing
The Double-Edged Sword: A Framework for Analyzing and Forecasting Capability Spillovers from AI Safety Research
Aditya Raj

Theoretical and empirical framework for forecasting unintended capability gains from safety-focused research programs. Addresses whether safety research inadvertently advances the capabilities it aims to constrain.

capability spillover safety research forecasting framework
Independent Research · Ongoing
Project Spillover: Quantifying the Alignment Tax via Mechanistic Interpretability
Aditya Raj

Controlled fine-tuning experiments demonstrating that naive safety fine-tuning causes catastrophic reasoning collapse. Validates the alignment tax hypothesis using mechanistic interpretability techniques to trace feature degradation across model layers.

mechanistic interpretability alignment tax safety fine-tuning feature degradation
Research Interests
Multi-Agent Safety
How agent-to-agent interaction degrades alignment. Emergent unsafe behavior in agent pipelines.
Alignment Robustness
Whether safety training survives deployment contexts it was not trained on.
Mechanistic Interpretability
Using circuit-level analysis to validate alignment hypotheses empirically.
Safety-Capability Tradeoffs
Quantifying the alignment tax. When does safety training hurt capability and why.
Technical Training
Selective / Technical
AI Safety Fundamentals: Technical  — BlueDot Impact
Cooperative AI Fellowship  — AI Safety Asia · multiagent systems and coordination
Policy & Governance
AI Governance Intensive  — BlueDot Impact
EA Ecosystem
Precipice Cohort
Technical AI Safety  — EA Hungary
Biosecurity Fundamentals  — BlueDot Impact
Writing Intensive  — BlueDot Impact
Contact

Open to research collaborations, paper feedback, and conversations about empirical AI safety methodology.