Research — Aditya Raj

Papers

Submitted · ACL 2026 Student Research Workshop

Does Inter-Agent Persona Exposure Erode Safety Alignment?

Aditya Raj

52,000+ controlled LLM API calls

A controlled empirical study testing whether AI agents corrupt each other's safety alignment through persona-based social interaction. Designed multi-agent interaction protocols across multiple model families and safety conditions. Evaluated using LLM-as-judge methodology with robustness checks across judge models and prompt variations.

multi-agent interaction persona exposure alignment robustness LLM-as-judge controlled experiment multi-model evaluation

OpenReview ↗ Code ↗

SPAR Fellowship · Ongoing

The Double-Edged Sword: A Framework for Analyzing and Forecasting Capability Spillovers from AI Safety Research

Aditya Raj

Theoretical and empirical framework for forecasting unintended capability gains from safety-focused research programs. Addresses whether safety research inadvertently advances the capabilities it aims to constrain.

capability spillover safety research forecasting framework

Independent Research · Ongoing

Project Spillover: Quantifying the Alignment Tax via Mechanistic Interpretability

Aditya Raj

Controlled fine-tuning experiments demonstrating that naive safety fine-tuning causes catastrophic reasoning collapse. Validates the alignment tax hypothesis using mechanistic interpretability techniques to trace feature degradation across model layers.

mechanistic interpretability alignment tax safety fine-tuning feature degradation

Research Interests

Multi-Agent Safety

How agent-to-agent interaction degrades alignment. Emergent unsafe behavior in agent pipelines.

Alignment Robustness

Whether safety training survives deployment contexts it was not trained on.

Mechanistic Interpretability

Using circuit-level analysis to validate alignment hypotheses empirically.

Safety-Capability Tradeoffs

Quantifying the alignment tax. When does safety training hurt capability and why.

Technical Training

Selective / Technical

AI Safety Fundamentals: Technical — BlueDot Impact
Cooperative AI Fellowship — AI Safety Asia · multiagent systems and coordination

Policy & Governance

AI Governance Intensive — BlueDot Impact

EA Ecosystem

Precipice Cohort
Technical AI Safety — EA Hungary
Biosecurity Fundamentals — BlueDot Impact
Writing Intensive — BlueDot Impact

Contact

Open to research collaborations, paper feedback, and conversations about empirical AI safety methodology.

rajsecrets03@gmail.com OpenReview ↗ GitHub ↗ LinkedIn ↗