Implemented a data ingestion framework using Java Spark on EMR serverless. This system manages data ingestion from over 800 sources, including multi billion row relational tables and Kafka topics. Critical in providing data to the Data Lake and Redshift for our ML, platform, and analytics teams.
Led a migration project, creating data pipelines for more than 7 core AncestryDNA and PetDNA databases. The approach leveraged Airflow, Spark, EMR, and Glue to streamline data loading into the data warehouse, greatly enhancing support for our analytics teams.
Delivered various data products and scalable APIs.
Key role in designing and developing essential components for the Ancestry Data Lake that facilitated ML, Data Science workflows, and analytics for the company.