Experience
2020 — Now
Redwood City, California, United States
Building Streaming API allows third-party Financial Institutions to consume the real-time banking activity data through Apigee API service using Pub/Sub.
Migrate the existing Map-Reduce based on-prem data pipelines/platform to GCP streaming based data pipelines/platform by using Apache Beam.
• Build and deploy streaming Dataflow pipelines processing ~2k per second syslog messages. The pipelines consume data from Pubsub and Firestore, transform syslog data (filtering, validation, data replay, de-duplication and grouping) to banking activity Avro data and ingest into BigQuery, Bigtable which allows third party Financial Institutions to query the real-time data through Apigee API service.
• Build and deploy batch Dataflow pipelines to read data from BigQuery, transform (filtering, validation and grouping) and generate daily reports for third party Financial Institutions.
• Introduce and deploy Apache Airflow as workflow scheduling tool and Cloud functions to run daily/hourly batch data flow pipelines generate reports in Google Cloud Composer.
2017 — 2019
2017 — 2019
San Mateo, CA
• Migrated and evolved manual on premise data processing tool to cloud based automated data pipeline
Designed a generic ETL pipeline can receive different schema healthcare raw data and ingest to the standard format using Spark 2.4 and Scala.
Introduced Apache Airflow as workflow scheduling tool and brought to production on AWS EMR.
Migrated Lumiata ETL pipeline with ~ 40 million patient healthcare records from on premise to AWS EMR and GCP Dataproc.
Transformed ~ 40 million patient healthcare raw data with CSV format to standard healthcare data format (HAPI FHIR) and imported the standard output into BigQuery.
• Built a generic Pyspark application to generate summary report for ~ 40 million patient healthcare raw data. Validated healthcare raw data based on data types and generate statistics report for all the values.
• Built integration test for Lumiata ETL pipeline. Generated statistics reports for both raw data and standard healthcare data to test ETL.
• Verified and debugged data using Jupyter Notebook.
2017 — 2017
2017 — 2017
San Francisco Bay Area
Develop the search infrastructure on Microsoft Azure cloud with MEAN stack.
2009 — 2010
Successfully synthesized multilayer composited films of gold nanoparticles and
semiconductor nanosheets via in-situ and layer-by-layer assembly methods.
Education
New York University
Master's degree
Xiamen University
Doctor of Philosophy - PhD
Inner Mongolia University