Redmond, Washington, United States
Designed, built, and operated large-scale distributed systems to monitor and improve the health of Azure’s global physical network infrastructure supporting cloud and AI workloads.
Developed intelligent, agent-based systems that ingest network telemetry, validate network state, detect anomalies, and automate mitigation of reliability issues at hyperscale.
Partnered with cross-functional stakeholders to define requirements, identify technical dependencies, and produce design documents for production services and platform capabilities.
Built and optimized backend services and data pipelines using C# and Python, with a strong emphasis on performance, scalability, and maintainability.
Integrated AI-driven and data-informed capabilities into infrastructure operations to enhance fault detection, diagnosis, and automated recovery.
Improved observability, telemetry, and monitoring systems to accelerate issue detection, root-cause analysis, and resolution of production incidents.
Mentored engineers and contributed to engineering best practices, code quality, and operational excellence across the team.