Seattle, Washington, United States
Initialized and led a project to build LLM development platform which covers (1) LLM SFT/DPO/PPO training automation, (2) LLM evaluation operation framework including new metrics onboarding and evaluation orchestration (3) Underline GPU management.
Built GPU utilization monitoring and alerting system, which improved org level GPU utilization rate to >90%.
Improved LLM training success rate from < 40% to > 80%, by collaborating with data science teams and external infrastructure team.
Initialized CLI framework to serve LLM activities, for example: LLM release to runtime, GPU resource management, automated LLM experimentation etc.
Built automated LLM leaderboard, which enabled automated metrics compare table generation cross arbitrary LLM artifacts on arbitrary metrics.
Building LLM fine tuning engineering system to accelerate LLM experiment execution and LLM release.
Led to build extensible offline model inferencing solution hosting 10+ models (with 100+ million parameters), inferencing on TB level data daily. By cross org collaboration with infrastructure teams, enabled >90% data SLA availability. By driving cross team collaboration, proposed unified offline infrastructure to get the team ready for LLM.