Real-time video captioning powered by Vision Language Models, running entirely in your browser with WebGPU acceleration. A privacy-focused web application that brings AI-powered video captioning directly to your browser.
Role
Developer
Skills
FastVLM WebGPU is a privacy-focused web application that brings AI-powered video captioning directly to your browser. Using the lightweight FastVLM-0.5B model, it analyzes video frames from your webcam or uploaded videos and generates descriptive captions in real-time—all without sending any data to external servers.
# Clone the repository
git clone <your-repo-url>
cd fastvlm-webgpu
# Install dependencies (using Bun)
bun install
# Or using npm
npm install# Start the development server
bun dev
# Or using npm
npm run devOpen http://localhost:5173 in your browser.
# Build the project
bun run build
# Preview the production build
bun run previewFastVLM WebGPU uses a lightweight Vision Language Model (FastVLM-0.5B) optimized for browser inference:
The entire pipeline runs client-side using WebGPU for hardware acceleration, ensuring privacy and enabling offline usage.
fastvlm-webgpu/
├── src/
│ ├── components/ # React UI components
│ │ ├── CaptioningView.tsx
│ │ ├── PromptInput.tsx
│ │ ├── LiveCaption.tsx
│ │ └── ...
│ ├── context/ # VLM model context and hooks
│ │ └── VLMContext.tsx
│ ├── types/ # TypeScript type definitions
│ ├── constants/ # App constants and configurations
│ └── App.tsx # Root application component
├── package.json
├── vite.config.ts
└── tsconfig.jsonThis project is an improved version of the original FastVLM WebGPU demo by Apple, with enhancements including: