• Designed and deployed a stand-alone Machine Learning framework in Python for automated root-cause categorization of workflow failures for the ‘Jupyter Notebooks as a Service’ team in Amazon Cloud Machine Learning Platform (AWS Sagemaker)
• Developed a Dataset management system integrated with the classifier on AWS S3 cloud supporting version management control for continuous update of training data, making it scalable, fault-tolerant, cost effective, and easy to manage & use
• Lead the design and development workflow (Agile Scrum) of the entire framework by making the first cross-disciplinary effort facilitating Machine Learning solutions in the internal operations repo on a distributed computing environment (AWS EC2)
• Built CLI scripting tools for retrieval & updating of data, training of classifier & region-based statistics report generation; added 5 different classification models to the classifier giving top 2 root causes with probability scores showing 91% accuracy
• Performed end-to-end unit testing and integration testing using Pytest; and code reviews using Amazon CRUX tool