New York City Metropolitan Area
• Traced critical request path among 18 microservices in a distributed system, created Grafana Dashboards for 50+ gRPC method performance, and drove a company-wide initiative for CUJ-based SLO adaptation
• Developed availability and latency SLOs using in-house SLO tooling for the Reddit site, adding metrics, Grafana dashboards, and alerting rules based on Google SRE best practices to help engineering teams for better incident response
• Identified key SLIs, and SLOs and instrumented Prometheus into Reddit’s service catalog to enhance its observability, then analyzed performance and increased its data freshness by 33%
• Analyzed Kubernetes instance lifecycles to uncover the fundamental origins of GraphQL 5xx error
anomalies within a distributed system