Explore projects
-
A high-throughput and memory-efficient inference and serving engine for LLMs
Updated -
-
cortex.llamacpp is a high-efficiency C++ inference engine for edge computing. It is a dynamic library that can be loaded by any server at runtime.
Updated -
MLX: An array framework for Apple silicon
Updated -
Swift API for MLX
Updated -
https://github.com/AnswerDotAI/gpu.cpp A lightweight library for portable low-level GPU computation using WebGPU.
Updated -
Forked from BtbN/FFmpeg-Builds FFmpeg Builds for yt-dlp
Updated -
https://github.com/janhq/cortex Drop-in, local AI alternative to the OpenAI stack. Multi-engine (llama.cpp, TensorRT-LLM, ONNX). Powers
👋 JanUpdated -
https://github.com/janhq/cortex.tensorrt-llm Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
Updated -
MicroPython - a lean and efficient Python implementation for microcontrollers and constrained systems
Updated -
Code for QuaRot, an end-to-end 4-bit inference of large language models.
arxiv.org/abs/2404.00456
Updated -
User-Mode Driver for Tenstorrent hardware
Updated -
https://github.com/Picovoice/web-utils Package containing web utility functions for Picovoice web bindings and sdks.
Updated -
https://github.com/coqui-ai/inference-engine Coqui Inference Engine
Updated