Berkeley, California, United States
➤ Conducted research that doubled a language model’s robustness to attacks through adversarial training, resulting in a NeurIPS 2022 publication.
➤ Fine tuned large language models based on deBERTa with custom EC2 experimentation platform and built language model-assisted adversarial attack tools with React, Flask, tailwind.css, dvc, Lambda Labs, and HuggingFace.
➤ Built a PyTorch-based framework for rapidly prototyping adversarial attacks and training for transformer language models and used it to discover “relaxed” adversarial attacks that made toy models robust to all known adversaries without degrading performance.
➤ Used cutting-edge interpretability tools to find new circuits in GPT-2 that extrapolate patterns and explained more than 90% of that behavior.
➤ Mentored two teams of researchers who studied a language model and identified new compositions of attention heads, akin to induction heads.