Rishabh Jain

Rishabh Jain

Rishabh Jain

Software Engineer at ByteDance | TikTok

Experienced Software Developer with 6+ years in building products in APIs and Cloud Security domain.

LinkedIn GitHub

Experience

TikTokSoftware Engineer - Data Infrastructure SRE

2024 — Now

2024 — Now

San Jose, California, United States

Dedicated to developing, optimizing and overseeing one of the industry's most extensive cloud

infrastructures, with a focus on site reliability and cloud managed platform, covering big data

computing, orchestration, storage, AI/ML infra, non-SQL, and relational databases.

Participate in and enhance the complete service lifecycle, from inception and design, through

development, capacity planning, launch reviews, deployment, operation, and refinement.

Design and implement software platforms and monitoring frameworks to govern service-oriented

architecture (SOA) efficiently, automatically, and intelligently.

Develop and manage components of cloud-managed data infrastructure, encompassing technologies

such as Kubernetes, Redis, MySQL, Flink, and more

ByteDanceSoftware Engineer - Data Infrastructure SRE

2024 — Now

2024 — Now

San Jose, California, United States

Dedicated to developing, optimizing and overseeing one of the industry's most extensive cloud

infrastructures, with a focus on site reliability and cloud managed platform, covering big data

computing, orchestration, storage, AI/ML infra, non-SQL, and relational databases.

Participate in and enhance the complete service lifecycle, from inception and design, through

development, capacity planning, launch reviews, deployment, operation, and refinement.

Design and implement software platforms and monitoring frameworks to govern service-oriented

architecture (SOA) efficiently, automatically, and intelligently.

Develop and manage components of cloud-managed data infrastructure, encompassing technologies

such as Kubernetes, Redis, MySQL, Flink, and more

NetskopeSenior Software Engineer (Site Reliability)

2020 — 2024

2020 — 2024

Santa Clara, California, United States

✧ Capacity Planning: Manually led capacity planning for multiple micro-services across multiple production stacks

Built capacity reporting tool to identify VMs/Hosts with overprovisioned CPUs, and memory on underlying KVM. Automation helped in fixing major performance bottlenecks and saving 30% in infrastructure costs

Led cross-team efforts on a tenant onboarding project, employing T-shirt sizing methodology, leading to streamlined capacity planning

✧ Monitoring & Alerting: Collaborated with cross-functional teams to understand complex application architectures and implement effective top-down monitoring strategies, resulting in improved service visibility, reduced MTTD, and proactive issue resolution

✧ Infrastructure & Automation: Developed IaaC libraries for provisioning and operating infrastructure at a massive scale using Terraform

Implemented Noname WAAF across Netskope to increase visibility to our web access firewall

✧ CI/CD: Enhanced existing Deployment Jenkins Pipelines to reduce overall deployment time from 12 to 3 hrs across multiple stacks

Implemented Spinnaker as CI/CD solution for faster release churn, rollbacks, and canary for k8s native supported infrastructure

✧ Onboarding: Led system designs and features to improve availability, scalability, latency, and efficiency of multiple microservices

Embedded with product teams to ensure that applications are production-ready, scalable, and reliable

Mentored newly onboarded team members on design principles, documentation efforts, troubleshooting production application services, and SRE best practices

Led incidents post-mortem to identify root cause, ensure remediation, and further identify measures to curb the future repetition of the issues

Introduced and streamlined processes for on-call and incident management

NetskopeSoftware Engineer

2019 — 2020

2019 — 2020

Santa Clara, California

✧ Monitoring & Alerting: Created service monitoring dashboards, actionable incident alerts, comprehensive Runbooks

✧ CI/CD: Developed ansible CD pipeline to deploy packages across multiple microservices, reducing deployment time from 20 to 12 hrs

✧ On-call: Worked on 12/7 production on-call for a large fleet of hosts, monitoring host/app health, triaging/resolving errors on the application and host level, identifying and disabling faulty applications/features, leveraging SRE tools and automation, mitigating outages

Reviewed and approved PRDs for new services and managed new services as they were onboarded for SRE support

Built an automotive system to poll data from different SAAS apps and inject data into the production environment

Education

University of Southern California

Master of Science - MS

Delhi University

Bachelor of Technology (B.Tech.)