Burlingame, California, United States
Build and operate the data infrastructure behind Reality Labs' multimodal AI models — the systems that acquire, process, and serve training and evaluation datasets at scale.
– Designed multimodal pipelines processing millions of media assets for model training and evaluation, with emphasis on data quality, provenance tracking, and traceability from source through to training runs.
– Built semantic search and discovery tooling (embedding generation, KNN indexing) that reduced dataset curation time for ML researchers, enabling faster experiment iteration across Reality Labs.
– Created an org–wide data resource usage attribution system covering collection/generation/storage for training & evaluation datasets, enabling stakeholders to make informed decisions around data acquisition and retention.
– Partnered with research teams to define data requirements for new model capabilities, translating researcher needs into scalable infrastructure.