Overview

FastVLM WebGPU is a privacy-focused web application that brings AI-powered video captioning directly to your browser. Using the lightweight FastVLM-0.5B model, it analyzes video frames from your webcam or uploaded videos and generates descriptive captions in real-time—all without sending any data to external servers.

Features

Dual Video Input: Support for both live webcam feeds and uploaded video files
Real-time Captioning: Continuous frame analysis with streaming text output
100% Private: All processing happens locally in your browser—no data leaves your device
Modern UI: Elegant glass-morphism design with draggable components
Custom Prompts: Personalize captions with your own instructions
Full Video Controls: Play, pause, seek, and volume control for uploaded videos
Offline Capable: Works completely offline after initial model download
GPU Accelerated: Leverages WebGPU for fast, efficient inference
Accessible: Built with ARIA labels and semantic HTML

Technologies

React 19 with TypeScript for type-safe UI development
WebGPU for GPU-accelerated model inference
@huggingface/transformers for running ML models in the browser
Tailwind CSS 4 for modern, responsive styling
Vite 7 for lightning-fast development and optimized builds
ONNX Runtime with quantized model weights (Q4) for efficient inference

Getting Started

Prerequisites

Node.js 18+ or Bun runtime
A modern browser with WebGPU support (Chrome 113+, Edge 113+, or Opera 99+)

Installation

Bash

# Clone the repository
git clone <your-repo-url>
cd fastvlm-webgpu

# Install dependencies (using Bun)
bun install

# Or using npm
npm install

Development

Bash

# Start the development server
bun dev

# Or using npm
npm run dev

Open http://localhost:5173 in your browser.

Building for Production

Bash

# Build the project
bun run build

# Preview the production build
bun run preview

Usage

Choose Your Video Source: Select either webcam or upload a video file
Grant Permissions: Allow webcam access if using live video
Wait for Model Loading: The FastVLM model will download and initialize (first time only)
Start Captioning: Click "Start Captioning" to begin real-time analysis
Customize Prompts: Enter custom instructions in the prompt box to guide the AI
View Live Captions: Watch as captions stream in real-time based on the video content

How It Works

FastVLM WebGPU uses a lightweight Vision Language Model (FastVLM-0.5B) optimized for browser inference:

Frame Capture: Extracts frames from video at 50ms intervals using HTML5 Canvas API
Vision Encoding: Processes frames through the vision encoder (Q4 quantized)
Language Generation: Generates captions using the decoder with custom prompts
Streaming Output: Displays tokens as they are generated for real-time feedback

The entire pipeline runs client-side using WebGPU for hardware acceleration, ensuring privacy and enabling offline usage.

Model Details

Model: FastVLM-0.5B (ONNX format)
Parameters: 500 million
Quantization: 4-bit (Q4) for decoder and vision encoder, fp16 for embeddings
Max Tokens: 512 tokens per inference
Format: ONNX Runtime Web with WebGPU backend

Browser Compatibility

Chrome 113+ - Supported
Edge 113+ - Supported
Opera 99+ - Supported
Firefox - WebGPU in development
Safari - WebGPU in development

Project Structure

text

fastvlm-webgpu/
├── src/
│   ├── components/         # React UI components
│   │   ├── CaptioningView.tsx
│   │   ├── PromptInput.tsx
│   │   ├── LiveCaption.tsx
│   │   └── ...
│   ├── context/           # VLM model context and hooks
│   │   └── VLMContext.tsx
│   ├── types/             # TypeScript type definitions
│   ├── constants/         # App constants and configurations
│   └── App.tsx            # Root application component
├── package.json
├── vite.config.ts
└── tsconfig.json

Credits

This project is an improved version of the original FastVLM WebGPU demo by Apple, with enhancements including:

Enhanced UI/UX with glass-morphism design
Draggable interface components
Support for both webcam and file upload
Improved prompt management with suggestions
Better error handling and loading states
Full video playback controls
Accessibility improvements

Acknowledgments

Original FastVLM WebGPU implementation by Apple
Hugging Face Transformers.js team for the browser ML framework
FastVLM model researchers and contributors