# Data Processing Framework based on Apache Spark, Kafka, Zookeeper, YARN and AWS EMR, S3.
• Developed and tested highly configurable Apache Spark based data processing ETL framework
• Automated complex data processing workflows & operationalized pipelines using Kafka, Zookeeper, Jenkins, EMR
• Re-architected and optimized release process for the offline data processing pipelines to reduce deployment time of software and configuration by 75%
• Prototyped Apache Spark based streaming pipeline for real-time incremental updates
# Web service for EMR Cluster management and YARN Job Submission
• Developed RESTful web service for centralized cluster management of all deployed AWS EMR clusters
• Implemented frontend and activity dashboard for this web service using Flintjs & Javascript
• Implemented single-click YARN job submission & tracking using predefined job templates and job submit history
• Deployed the distributed software stack using Docker, Jetty, AWS RDS, AWS EC2 and AWS EMR.
# Content Discovery and Publishing tool
• Architected & developed scraper for AWS S3/Aliyun OSS that discovers content using topological folder access
• Prototyped content injection tool that leverages AWS SNS and SQS for reliable and seamless content discovery
• Operationalized software using Jenkins, Zookeeper and Kafka
Technologies used: Apache Spark, Apache Kafka, Apache Zookeeper, AWS EMR, AWS S3, AWS RDS, flintjs, Bash, Jenkins, Scrum, Git/gitflow