• Architected a high-throughput RAG pipeline to ingest and process SEC filings (10-K), implementing a custom parser that normalizes heterogeneous HTML/PDF content into structured formats.
• Reduced cloud infrastructure costs by 100% ($10K/month) by engineering a custom OCR solution to replace AWS Textract, maintaining accuracy while improving processing speed and reliability.
• Designed a custom chunking library tailored for financial documents, optimizing token usage and preserving tabular data integrity significantly better than off-the-shelf solutions like LangChain.
• Implemented a hybrid search algorithm combining semantic (SentenceTransformers) and fuzzy matching (RapidFuzz) to align section headers across disparate documents, improving retrieval accuracy.
• Built a document classification model achieving 98.5% accuracy by optimizing embedding strategies and scoring functions for financial data categorization.
• Established a comprehensive CI testing suite using Pytest, covering the full data ingestion lifecycle and accelerating the release cycle by 20%.