Senior Software Engineer in AI/ML infra. Proven skills in distributed systems, building and maintaining a container orchestration platform for platform team.
Experience
2025 — Now
2025 — Now
• Leading resource management for PayPal's AI/ML training infrastructure at scale.
• Implemented Kueue scheduler for efficient workload queuing and resource allocation across Kubernetes clusters.
• Built and maintained Argo Workflows-based pipeline system for orchestrating distributed training jobs.
• Developed reusable pipeline templates enabling developers to easily schedule and manage training jobs.
• Integrated KubeRay for scalable distributed training on Kubernetes using Ray clusters.
• Core development in Go for infrastructure tooling and platform services.
• Large scale compute and GPU optimization using Nsight, Pytorch Profiling
2022 — 2024
2022 — 2024
San Jose, CA
I am part of Platform Engineering Team of Paypal that focus on Model Development and Life cycle
• Understood the pain points of developers within paypal and build platform solutions to reduce redundancy and optimize time for business. This helps Data Scientist, Machine Learning Engineers and data engineers to focus on building logics and worry less about under the hood Distributed Systems complications.
• Built and manage infrastructure of Jupyter Notebooks application used by 15000+ developers that runs on multiple cluster. It has a reliable compute and storage infrastructure.
• Managed HPC cluster with 850+ nodes that include High Memory and GPU machine (A100, H100, B200) that spreads across on-prem and cloud infrastructure.
• Build airflow environment to support scheduling of jupyter notebooks handling 25000+ DAGs on a daily basis ranging from ETL to Deep Learning activities and what not a data person can do.
• Leverage AI agentic workflows to automate and reliably manage the infrastructure to catch instability quicker providing an additional edge to classic observability techniques.
• Migration of users from legacy VM infrastructure to modern hybrid distributed system with minimal downtime.
• Working horizontally with leader and stake holder to design the requirements of the project in our data access platform and bring fluid integration to other platforms.
• Assist developers in helping them with environments and pipelines for Multimodal Data Storage, Training of multimodal models using distributed systems like Ray
• Kueue Scheduler for Resource Management
2020 — 2022
California, United States
• Founded the Machine Learning Platform team of Adtalem. Developed a Lead Inquiry Scoring System using advanced machine learning to segment customers for marketing activity
• Used a bootstrap aggregation method to combine several weak learner to build a strong ML algo to classify realtime unlabelled input data
• Appropriately leveraged the potential of Python, AWS - Lambda/Cloudwatch/SageMaker, API Gateway, Docker, Snowflake
• Presented performance metric and business impact to leadership. Benchmarked against 3rd party software saving and generating revenue of Millions of $ USD/annually
• Worked alongside with Dev-ops team to assist in developing and maintaining ML containers for various ML projects
2019 — 2020
2019 — 2020
Charlotte, North Carolina Area
• Defended my thesis under the title - 'PREPARATION OF UNITED STATES’ STATE COURT CASES DATASET AND DERIVING AN EFFICIENT EMBEDDING FOR CASES'
• Automated the text mining on Supreme Court case data (1954 to 2014) and created a graph network database that can be used by researchers
in the United States judicial department
• Performed trend analysis on docket size, dissent rates and legal issues heard in United States Supreme Courts
• Transformed Case text to vectors and studied vector similarity across 2.7 Million case documents
2019 — 2019
Princeton, New Jersey
Architecture and Verification of Intelligent Sytems at Siemens - Corporate Research and Technology
• Used Anomaly Detection to invent a new dataset space exploration (DSE) algorithm that helps identify bias in the training model
• Identified the potential data points in the training dataset using DSE which reduced the model training time by 40%
• Reduced the number of real-time simulation iterations performed on the locomotive which saved approximately $1.35M of expenses
• Worked on Digital Twins and simulations. Predicted the optimum speed for a locomotive with an accuracy of 90% using Deep Neural Network
Education
University of North Carolina at Charlotte
Master's degree
2018 — 2019
Indian Institute of Science (IISc)
CCE
2017 — 2018
Kumaraguru College of Technology
Bachelor of Engineering (B.E.)
2013 — 2017