Web Entity Extraction and Validation at large scale using Deep Learning and LLMs
• Designed and developed H1’s Entity Validator service based on FastAPI, Flask and
Transformers framework for efficient data validation and successfully deployed in Web Extraction Pipeline on AWS to improve
data validation efficiency at large scale, resulting in reduction of turn-around-time of 95%.
• Designed and developed the PySpark Web Extraction pipeline for entity extraction having multiple services which replaced the legacy extraction framework leading to increase in throughput of 600%.
• Engineered algorithms for reverse mapping, voting to identify common XPaths for increased efficiency in extraction by 30%.
• Development of Cleanup module which resulted in increased quality of production data and reduced data issues by 40%.
• Worked on Prompt engineering and fine tuning for OpenAI GPT-3.5-turbo based Entity Extraction increasing recall by 16% and precision by 11%.
Skills: RDBMS · MySQL · Back-End Web Development · Continuous Integration and Continuous Delivery (CI/CD) · PySpark · SQL · Data Structures · BERT · Apache Spark · Machine Learning Algorithms · Data Engineering · Mathematics · Pandas · NumPy · Data Science · Computer Science · Natural Language Processing (NLP)