Mountain View, California, United States
š¹ Logs Pipeline: Developed a Scala Spark job that processes over 3 TB of data per hour, transforming raw logs from S3 to HDFS with partitioning and converting data from CSV to Parquet for optimized storage and querying. Designed aggregated pipelines tailored to clubbed use cases for targeted analysis. Utilized Airflow for scheduling, setting up alerts, enforcing SLAs, and ensuring optimal resource utilization.
š¹ Deduplication Theory and Factor Generation: Developed a data-intensive algorithm using PySpark to identify and categorize duplicate requests, helping generate available inventory. Integrated ElasticSearch for efficient indexing and retrieval of duplicate data and used Athena for fast querying of large datasets. Automated the entire deduplication process through Airflow, ensuring efficient handling of large-scale data and improving data quality for better inventory management.
š¹ Alerting Mechanism: We developed a scalable alerting system in Node.js to manage 500+ supply and demand tags, triggering real-time alerts when spending exceeded thresholds or irrational numbers were detected. The solution leverages Node.js, Redis, and AWS, ensuring low-latency monitoring and automated alerts via Slack and email. This system streamlined anomaly detection, reducing manual oversight and enabling faster decision-making.