The Future of WebGPU and Web LLM: What You Need to Know

The Future of WebGPU and Web LLM: What You Need to Know

·

2 min read

Table of contents

No heading

No headings in the article.

The Web LLM project is an exciting initiative that aims to make large-language models (LLMs) and LLM-based chatbots easily accessible by bringing them directly to web browsers. By running these AI assistants inside the browser without server support and harnessing the power of WebGPU for acceleration, the project ensures both privacy and efficient performance.

Some of the open-source language models and personal AI assistants supported by the Web LLM project include LLaMA, Alpaca, Vicuna, and Dolly. The project builds on a range of technologies, such as Apache TVM Unity, Hugging Face, LLaMA, Vicuna, wasm, and WebGPU. It employs machine learning compilation (MLC) technology, using TensorIR for optimized programs, int4 quantization techniques for model weight compression, static memory planning optimizations, and Emscripten and TypeScript for a TVM web runtime that can deploy generated modules.

Web LLM is compatible with various GPU backends, including WebGPU, CUDA, OpenCL, and Vulkan, making it easy to deploy LLM models. To set up a local deployment, users need to install TVM Unity, emscripten, Rust, wasm-pack, jekyll, jekyll-remote-theme, and Chrome Canary. Native deployment options with local GPU runtime are also available, allowing for performance comparisons between native GPU drivers and WebGPU.

WebGPU is a promising development for a wide range of platforms, such as web, Windows, Mac, Linux, ChromeOS, iOS, and Android. Although its current matrix multiplication performance is not as high as native, it is expected to improve through engineering challenges and extensions like cooperative matrix multiply and bindless rendering. WebGPU 1.0 is just the beginning, and its future looks bright as it continues to evolve.

Recently, WebGPU has been shipped to Chrome and is now in beta. The Web LLM project has been tested on Windows and Mac with a GPU of about 6.4G memory, using the vicuna-7b-v1.1 model for the chat demo. More model support is in the pipeline, and interested users can find the project on GitHub, along with a related project called Web Stable Diffusion. This innovative project is set to revolutionize the way we interact with AI assistants, making them more accessible and private for users everywhere.

Did you find this article valuable?

Support Protopian by becoming a sponsor. Any amount is appreciated!