Redwood City, California, United States
• Developed robust end-to-end batch and real-time streaming pipelines that processed tens of terabytes of daily data orchestrated by Airflow using Python, SQL, Scala, Spark, Databricks, and Snowflake
• Optimized Docker containers for efficient layers that were orchestrated by Kubernetes and integrated with CircleCI
• Fine tuned SQL queries, Spark jobs and clusters for performance gains and reduction in overall costs
• Implemented robust metadata synchronization and data transfer between Snowflake and Databricks
• Migrated from AWS only to Databricks on AWS, Maven to Gradle, Ansible to Terraform, and Zeppelin to Databricks notebooks and integrated Delta Lake
• Developed tools for tracking table-level data lineage and schema evolution in Python and SQL that was key in streamlining the simplification of pipelines and the data model