Post-training data infrastructure (2025–Present)
Led data infrastructure across SFT, reward modeling, and RLHF for Llama 4 and multimodal foundation models. Built a data management system across model checkpoints, powering team-wide data quality dashboards used by researchers.
Drove end-to-end privacy compliance enabling a public checkpoint launch. Schematized and risk-mitigated 500+ datasets across text, image and video.
Built offline inference tooling for various GPU hardwares (like H100 and GB200) using vLLM. Used for LLM based data quality judges and syntehtic data generation via rejection sampling.
Operationalized data pipelines at scale such as decontamination, diversity sampling, taxonomy tagging, and sample validation gating all new training data proposals.
Earlier Meta work (2020–2024)
Architected a low latency GPU-backed realistic avatar generation service with high E2E reliability;
Launched a server-side streaming audio effects service used for noise-suppression and dubbing of Facebook videos(Dec 2021). Built bi-directional audio inference from scratch in modern C++ with coroutines.
Led an external ASR vendor deprecation that saved $1M+/year and eliminated dual-stack operational cost.
Launched Instagram Content Publishing API publicly (Jan 2021). Drove improving service reliability, cross-
functional launch coordination, and integrity reviews.