FastVLM WebGPU
Overview
FastVLM WebGPU is a privacy-focused web application that brings AI-powered video captioning directly to your browser. Using the lightweight FastVLM-0.5B model, it analyzes video frames from your webcam or uploaded videos and generates descriptive captions in real-time—all without sending any data to external servers.
Features
- Dual Video Input: Support for both live webcam feeds and uploaded video files
- Real-time Captioning: Continuous frame analysis with streaming text output
- 100% Private: All processing happens locally in your browser—no data leaves your device
- Modern UI: Elegant glass-morphism design with draggable components
- Custom Prompts: Personalize captions with your own instructions
- Full Video Controls: Play, pause, seek, and volume control for uploaded videos
- Offline Capable: Works completely offline after initial model download
- GPU Accelerated: Leverages WebGPU for fast, efficient inference
- Accessible: Built with ARIA labels and semantic HTML
Technologies
- React 19 with TypeScript for type-safe UI development
- WebGPU for GPU-accelerated model inference
- @huggingface/transformers for running ML models in the browser
- Tailwind CSS 4 for modern, responsive styling
- Vite 7 for lightning-fast development and optimized builds
- ONNX Runtime with quantized model weights (Q4) for efficient inference
Getting Started
Prerequisites
- Node.js 18+ or Bun runtime
- A modern browser with WebGPU support (Chrome 113+, Edge 113+, or Opera 99+)
Installation
Bash
# Clone the repository
git clone <your-repo-url>
cd fastvlm-webgpu
# Install dependencies (using Bun)
bun install
# Or using npm
npm installDevelopment
Bash
# Start the development server
bun dev
# Or using npm
npm run devOpen http://localhost:5173 in your browser.
Building for Production
Bash
# Build the project
bun run build
# Preview the production build
bun run previewUsage
- Choose Your Video Source: Select either webcam or upload a video file
- Grant Permissions: Allow webcam access if using live video
- Wait for Model Loading: The FastVLM model will download and initialize (first time only)
- Start Captioning: Click "Start Captioning" to begin real-time analysis
- Customize Prompts: Enter custom instructions in the prompt box to guide the AI
- View Live Captions: Watch as captions stream in real-time based on the video content
How It Works
FastVLM WebGPU uses a lightweight Vision Language Model (FastVLM-0.5B) optimized for browser inference:
- Frame Capture: Extracts frames from video at 50ms intervals using HTML5 Canvas API
- Vision Encoding: Processes frames through the vision encoder (Q4 quantized)
- Language Generation: Generates captions using the decoder with custom prompts
- Streaming Output: Displays tokens as they are generated for real-time feedback
The entire pipeline runs client-side using WebGPU for hardware acceleration, ensuring privacy and enabling offline usage.
Model Details
- Model: FastVLM-0.5B (ONNX format)
- Parameters: 500 million
- Quantization: 4-bit (Q4) for decoder and vision encoder, fp16 for embeddings
- Max Tokens: 512 tokens per inference
- Format: ONNX Runtime Web with WebGPU backend
Browser Compatibility
- Chrome 113+ - Supported
- Edge 113+ - Supported
- Opera 99+ - Supported
- Firefox - WebGPU in development
- Safari - WebGPU in development
Project Structure
text
fastvlm-webgpu/
├── src/
│ ├── components/ # React UI components
│ │ ├── CaptioningView.tsx
│ │ ├── PromptInput.tsx
│ │ ├── LiveCaption.tsx
│ │ └── ...
│ ├── context/ # VLM model context and hooks
│ │ └── VLMContext.tsx
│ ├── types/ # TypeScript type definitions
│ ├── constants/ # App constants and configurations
│ └── App.tsx # Root application component
├── package.json
├── vite.config.ts
└── tsconfig.jsonCredits
This project is an improved version of the original FastVLM WebGPU demo by Apple, with enhancements including:
- Enhanced UI/UX with glass-morphism design
- Draggable interface components
- Support for both webcam and file upload
- Improved prompt management with suggestions
- Better error handling and loading states
- Full video playback controls
- Accessibility improvements
Acknowledgments
- Original FastVLM WebGPU implementation by Apple
- Hugging Face Transformers.js team for the browser ML framework
- FastVLM model researchers and contributors