Key Responsibilities:
• Crafted, refined, and tested high-complexity and research-style prompts for LLM performance evaluation (Gemini, Google AI models).
• Created structured fanout questions and scenario-based test cases to expose model behavior and limitations.
• Reviewed and validated content for factual accuracy and clarity, identifying potential ambiguities. Analyzed prompts and responses for consistency.
• Rated AI responses using the framework and refined prompts to increase clarity, difficulty, and contextual relevance.
• Contributed to Google’s Magi Project, focusing on prompt design for AI-powered search functionality.
• Demonstrated strong attention to detail by consistently verifying information across multiple domains and sources, identifying inconsistencies.
• Contributed to the Human Raters: Conversation Semantic Consistency Scoring project. The project focused on rating the LLM for: Consistency, Factuality, and Avoiding contradictions.
• Executed Cultural Relevance Task evaluations, ensuring AI-generated content demonstrated cultural sensitivity, contextual appropriateness, and inclusivity across global user demographics.
• Worked on the Video Boundary Labeling project using Google Data Compute, identifying and annotating scene transitions with accuracy and consistency.
• Provide structured feedback to improve data quality and consistency.