-
Charlie Ruan authored
This PR adds OpenAI-like API to `ChatModule`, specifically the `chatCompletion` API. See `examples/openai-api` for example usage. See [OpenAI reference](https://platform.openai.com/docs/api-reference/chat) on its original usage. Changes include: - Implement `chatCompletion()` in `ChatModule` - Expose conversation manipulation methods in `llm_chat.ts` so that `request.messages` can override existing chat history or system prompt in `chat_module.ts` - Implement `src/openai_api_protocols` that represents all OpenAI-related data structure; largely referred to [openai-node](https://github.com/openai/openai-node/blob/master/src/resources/chat/completions.ts) - Add `examples/openai-api` that demonstrates `chatCompletion()` for both streaming and non-streaming usages, without web worker - Support both streaming and non-streaming `chatCompletion()` in `web_worker.ts` with example usage added to `examples/web-worker` - For streaming with web worker, users have access to an async generator whose `next()` sends/receives messages with the worker, which has an underlying async generator that does the actual decodings Existing gaps from full-OpenAI compatibility are listed in https://github.com/mlc-ai/web-llm/issues/276, some may be unavoidable (e.g. selecting `model` in request) while some are WIP. Benchmarked performance via `{WebWorker, SingleThread} X {OAI-Stream, OAI-NonStream, Generate}`, virtually no degradation, with +-1 tok/s variation. Specifically, on M3 Max Llama 2 7B q4f32_1, decoding 128 tokens with a 12-token prompt yield: - Prefill: 182 tok/s - Decode: 48.3 tok/s - End-to-end: 38.5 tok/s - Where end-to-end is from the time we create the request to finish everything; the time recorded on the highest level.
This project is licensed under the Apache License 2.0.
Learn more
Loading