Santa Clara, California, United States
AI Platforms/Deep Engine Engineering - ML Infra
• Restructured Training jobs by upgrading libraries which increased large image download speed by 23%
• Led M6i/C6i CPU instance expansion for ML Training to 30 regions, increasing availability by 40%, achieving a 25% reduction
in customer support inquiries, and increasing user satisfaction by 18%
• Conducted in-depth training jobs testing for performance, scalability, and reliability, resulting in a 20% reduction in downtime
• Developed an GPU Errors Filtering Tool to analyze and report ML job failures by GPU instance types, reducing manual log
checking by 100% and providing actionable insights for product improvement
• Reduced manual build efforts for SageMaker ML Training in Canada West (Calgary) by over 95% through Region Build
Automation, integrating the region into internal dashboards, and boosting build efficiency
• Facilitated GPU/CPU instance expansion in new Availability Zones and regions, advancing capacity for ML training jobs by 30%
across commercial, private, government, and high-security regions
• Upgraded checkpointing for Deep Learning containers and Docker error handling, addressing S3 storage issues, increasing sizes by 20%, and optimizing error messages, ensuring reliable job resumption and resolution