Research

A selection of my recent publications is listed below. See Google Scholar for the full list. Note: Asterisk (*) indicates equal contribution.

Reliable Weak-to-Strong Monitoring of LLM Agents

Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Christina Q Knight, Zifan Wang

Under review at ICLR 2026

Toggle abstract

We stress test monitoring systems for detecting covert misbehavior in LLM agents (e.g., secretly exfiltrating data). We propose a monitor red teaming (MRT) workflow that varies agent and monitor awareness, adversarial evasion strategies, and evaluation across tool-calling (SHADE-Arena) and computer-use (CUA-SHADE-Arena) environments. We benchmark standard monitor scaffoldings and introduce a hybrid hierarchical--sequential design. Our work positions MRT as a standard workflow for stress-testing oversight, revealing robustness gaps in both LLM- and human-based monitoring. We release code, data, and logs to support further research.

Membership Inference Attacks for Unseen Classes

Pratiksha Thaker*, Neil Kale*, Zhiwei Steven Wu, Virginia Smith

Under review at ICLR 2026

Toggle abstract

Current membership inference attacks rely on shadow models that imitate the target model, but these attacks break down when the attacker lacks in-distribution data, as in many real AI safety settings. We show that in these restricted-data regimes, shadow-model attacks can fail catastrophically, while quantile regression attacks—which learn features of members that generalize to unseen classes—remain effective. Empirically and theoretically, quantile regression attacks achieve up to 11× higher true positive rates than shadow-model methods, revealing a critical failure mode in existing tools and a caution for practitioners deploying MIAs in practice.

Mitigating Modal Imbalance in Multimodal Reasoning

Chen Henry Wu*, Neil Kale*, Aditi Raghunathan

Conference on Language Models 2025

Toggle abstract

We investigate the reliability of foundation models in performing joint reasoning over images and text. While models excel at detecting conflicts within a single modality, performance drops drastically in cross-modal scenarios due to an "attention imbalance" where the model disproportionately prioritizes one input type over another. We find that simply scaling post-training data fails to solve this issue, but explicitly training on combined modalities effectively restores attention balance and improves downstream performance on several vision-language benchmarks.

Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Pratiksha Thaker, Shengyuan Hu, Neil Kale, Yash Maurya, Zhiwei Steven Wu, Virginia Smith

SaTML 2025

Toggle abstract

Unlearning aims to make LLMs safer by removing harmful or sensitive information, but current benchmarks often give an overly optimistic picture of how well this works. With simple changes to popular benchmarks, we show that “forgotten” information can still be recovered and that retained knowledge can be harmed, especially when forget/retain sets are loosely dependent or targets are ambiguous. We urge caution when interpreting these benchmarks and offer recommendations for more reliable unlearning evaluation.