Skip to content
  • Charlie Ruan's avatar
    [API] Support OpenAI-like API chatCompletion for ChatModule (#300) · b386ef2e
    Charlie Ruan authored
    This PR adds OpenAI-like API to `ChatModule`, specifically the
    `chatCompletion` API. See `examples/openai-api` for example usage. See
    [OpenAI reference](https://platform.openai.com/docs/api-reference/chat)
    on its original usage.
    
    Changes include:
    - Implement `chatCompletion()` in `ChatModule`
    - Expose conversation manipulation methods in `llm_chat.ts` so that
    `request.messages` can override existing chat history or system prompt
    in `chat_module.ts`
    - Implement `src/openai_api_protocols` that represents all
    OpenAI-related data structure; largely referred to
    [openai-node](https://github.com/openai/openai-node/blob/master/src/resources/chat/completions.ts)
    - Add `examples/openai-api` that demonstrates `chatCompletion()` for
    both streaming and non-streaming usages, without web worker
    - Support both streaming and non-streaming `chatCompletion()` in
    `web_worker.ts` with example usage added to `examples/web-worker`
    - For streaming with web worker, users have access to an async generator
    whose `next()` sends/receives messages with the worker, which has an
    underlying async generator that does the actual decodings
    
    Existing gaps from full-OpenAI compatibility are listed in
    https://github.com/mlc-ai/web-llm/issues/276, some may be unavoidable
    (e.g. selecting `model` in request) while some are WIP.
    
    Benchmarked performance via `{WebWorker, SingleThread} X {OAI-Stream,
    OAI-NonStream, Generate}`, virtually no degradation, with +-1 tok/s
    variation. Specifically, on M3 Max Llama 2 7B q4f32_1, decoding 128
    tokens with a 12-token prompt yield:
    - Prefill: 182 tok/s
    - Decode: 48.3 tok/s
    - End-to-end: 38.5 tok/s
    - Where end-to-end is from the time we create the request to finish
    everything; the time recorded on the highest level.
    b386ef2e
This project is licensed under the Apache License 2.0. Learn more