Experience
2024 — Now
San Jose, California, United States
• Dedicated to developing, optimizing and overseeing one of the industry's most extensive cloud
infrastructures, with a focus on site reliability and cloud managed platform, covering big data
computing, orchestration, storage, AI/ML infra, non-SQL, and relational databases.
• Participate in and enhance the complete service lifecycle, from inception and design, through
development, capacity planning, launch reviews, deployment, operation, and refinement.
• Design and implement software platforms and monitoring frameworks to govern service-oriented
architecture (SOA) efficiently, automatically, and intelligently.
• Develop and manage components of cloud-managed data infrastructure, encompassing technologies
such as Kubernetes, Redis, MySQL, Flink, and more
2024 — Now
San Jose, California, United States
• Dedicated to developing, optimizing and overseeing one of the industry's most extensive cloud
infrastructures, with a focus on site reliability and cloud managed platform, covering big data
computing, orchestration, storage, AI/ML infra, non-SQL, and relational databases.
• Participate in and enhance the complete service lifecycle, from inception and design, through
development, capacity planning, launch reviews, deployment, operation, and refinement.
• Design and implement software platforms and monitoring frameworks to govern service-oriented
architecture (SOA) efficiently, automatically, and intelligently.
• Develop and manage components of cloud-managed data infrastructure, encompassing technologies
such as Kubernetes, Redis, MySQL, Flink, and more
2020 — 2024
Santa Clara, California, United States
✧ Capacity Planning: Manually led capacity planning for multiple micro-services across multiple production stacks
• Built capacity reporting tool to identify VMs/Hosts with overprovisioned CPUs, and memory on underlying KVM. Automation helped in fixing major performance bottlenecks and saving 30% in infrastructure costs
• Led cross-team efforts on a tenant onboarding project, employing T-shirt sizing methodology, leading to streamlined capacity planning
✧ Monitoring & Alerting: Collaborated with cross-functional teams to understand complex application architectures and implement effective top-down monitoring strategies, resulting in improved service visibility, reduced MTTD, and proactive issue resolution
✧ Infrastructure & Automation: Developed IaaC libraries for provisioning and operating infrastructure at a massive scale using Terraform
• Implemented Noname WAAF across Netskope to increase visibility to our web access firewall
✧ CI/CD: Enhanced existing Deployment Jenkins Pipelines to reduce overall deployment time from 12 to 3 hrs across multiple stacks
• Implemented Spinnaker as CI/CD solution for faster release churn, rollbacks, and canary for k8s native supported infrastructure
✧ Onboarding: Led system designs and features to improve availability, scalability, latency, and efficiency of multiple microservices
• Embedded with product teams to ensure that applications are production-ready, scalable, and reliable
• Mentored newly onboarded team members on design principles, documentation efforts, troubleshooting production application services, and SRE best practices
• Led incidents post-mortem to identify root cause, ensure remediation, and further identify measures to curb the future repetition of the issues
• Introduced and streamlined processes for on-call and incident management
2019 — 2020
2019 — 2020
Santa Clara, California
✧ Monitoring & Alerting: Created service monitoring dashboards, actionable incident alerts, comprehensive Runbooks
✧ CI/CD: Developed ansible CD pipeline to deploy packages across multiple microservices, reducing deployment time from 20 to 12 hrs
✧ On-call: Worked on 12/7 production on-call for a large fleet of hosts, monitoring host/app health, triaging/resolving errors on the application and host level, identifying and disabling faulty applications/features, leveraging SRE tools and automation, mitigating outages
• Reviewed and approved PRDs for new services and managed new services as they were onboarded for SRE support
• Built an automotive system to poll data from different SAAS apps and inject data into the production environment
Education
University of Southern California
Master of Science - MS
Delhi University