New York City Metropolitan Area
* Worked with a large multi-tenant Kubernetes on bare metal cluster
* Monitored Kubernetes cluster and infrastructure for issues and acted quickly to stem any negative impact, working with Development, Infrastructure, and Network Operations teams
* Helped maintain fleet health by performing routine OS updates on nodes, configuring server settings, and remote debugging with ssh to troubleshoot Kubernetes/docker related issues, Nvidia GPU issues with nvidia-smi and Xid error troubleshooting, and looking through kernel logs. Used bash scripting where applicable to automate processes
* Worked with contractors in remote sites to install, configure, and troubleshoot servers, network equipment, and data center infrastructure
* Liaised closely with Client Support Engineers to help clients get their AI/ML training and inference, VFX, and pixel streaming workloads running on the cluster. Debugged issues related to Kubernetes deployments, services, storage volumes, RBAC, virtual servers and more
* Gained exposure to ArgoCD, Argo workflows, Slurm, knative serverless, DHCP, Kubernetes on bare metal, Grafana dashboards and Loki and more
* Helped maintain high customer satisfaction by acting with empathy, understanding the business impact and priority of customer issues, and following our best practices
* Documented processes, best practices and expanded the KB and runbook
* Assisted with the hiring, training and development of new hires