Works on big data and backend services for the Data Engineering team as part of the Optum Analytics data factory.
Helped migrate our data warehouse from an Oracle database to a reactive near-real-time Spark-based data factory. Designed and implemented a highly concurrent ingestion pipeline in Scala using Akka and based on Kafka, in order to process over 90 million patient records per month at scale.
Wrote massively parallel Spark jobs to extract and process unstructured CSV data to be saved into Parquet for subsequent processing.
Built a caching system in Scala on top of HDFS to allow for fast access of intermediate ETL results (in Parquet) for running large Spark jobs, and saved the expensive re-generation of said results. Retention policy was parameterized on time and invalidated when underlying data was modified.
Frequently in charge of data extraction using Sqoop from an Oracle database and into our new data warehouse pipeline. Also implemented various internal tools and utilities including a command line utility to allow for execution of data workflows by teams without having to write SQL / MR jobs.