I'm interested in distributed databases and machine learning and have around 3 years of experience at building high volume data systems that serve ML and data science applications.
Experience
2024 — Now
2024 — Now
New York, United States
2022 — 2023
New York, New York, United States
Working under Dr. Eugene Wu and Zachuary Huang in the Wu Lab on improving databases and ML.
• Helped build a Python library to train tree based ML models on SQL databases.
• Paper Published: https://dl.acm.org/doi/10.1145/3592980.3595318
• Helped build a visualization library to display many to many joins for Wide Table Analytics.
• Paper Published: https://dl.acm.org/doi/10.1145/3597465.3605224
• Researched and helped evaluate Text-to-SQL performance using GPT4.
• Paper Published: https://arxiv.org/abs/2310.18742
2023 — 2023
2023 — 2023
New York, United States
• Reduced API latency by 40% by optimizing SQL queries and further identified pagination changes that result in 95% improvement.
• Profiled Kubernetes pod production usage patterns based on CPU and memory metrics and identified 15% cost reduction opportunity.
• Designed and implemented an error framework that differentiates between user and system error that helped reduce noise in alerts ultimately improving developer productivity.
2021 — 2022
2021 — 2022
Bangalore Urban, Karnataka, India
Architected and executed a Dynamic Error Classification System.
• Allows us to dynamically categorize any error in the system with a readable error message to improve UX and control retry behaviour based on error.
• Decreased time to deploy an error classification change from multiple hours to 1 minutes.
• Removed dependency on engineers and code changes to classify errors. Now Product Managers and Support staff can handle errors.
• Cut down errors displayed to user reduced by 50% for specific sources.
Designed and implemented a feature in our job scheduler (Handyman) to automatically schedule jobs based on resource needs across machines with different hardware resources (RAM, disk storage).
• This was mainly implemented to support ingestion jobs that required downloading multi-GB files. These jobs were automatically scheduled to run on nodes with large disk storage and automatically rerun on same node to continue ingesting same file without re-downloading.
Built a Destination Cost Recommendation Framework.
• Automatically collects metadata statistics about data warehouses being used in Hevo and stores it on a data lake.
• Automatically calculates these statistics and makes recommendations for the users to reduce the cost of using the warehouse with Hevo.
Improved ingestion rate by 8x for Google Analytics Connector by sampling data volume and intelligently distributing workload across parallel jobs.
Mentored multiple interns.
2020 — 2021
2020 — 2021
Bengaluru, Karnataka, India
Designed and implemented an autonomous and robust integration with Kafka as a source.
• Designed to scale out and scale in when it detects high data in source. Used linear regression and source data retention thresholds to automatically expand to accommodate extra data and scale in to save costs.
Integrated Firebolt as a Destination.
• Tackled ambiguous requirements, early documentation to deliver Firebolt on time by using library greps, debugging tools.
• Delivered Firebolt integration first in the market, giving Hevo an advantage and exclusive partnership deals.
• Added new features such as Parquet support and new key types in our Mapping component.
Optimized sideline events flow.
• Reduced the time taken for each sidelined event to be visible to the users by at least 2x (5+ minutes to 1 minute).
• Added visibility for users to understand the state of the events as soon as possible.
Education
Columbia University
Master of Science - MS
2022 — 2023
Manipal Institute of Technology
Bachelor of Technology
2016 — 2020