-
Ruihang Lai authored
This PR introduces the CUDA IPC memory support in TVM runtime. IPC memory allows multiple distribtued workers accessing the GPU memory of each other directly. This functionality is helpful for implementing customzied communication primitives across distributed workers. In this PR, we bring the customized all-reduce implementation from TensorRT-LLM into 3rdparty. This all-reduce implementation makes use of the CUDA IPC memory. We expose the all-reduce function in global function under namespace `tvm::runtime::disco::cuda_ipc`. One unit test for the customized all-reduce kernel over two workers is added. --- Co-authored-by: Hongyi Jin <hongyij@andrew.cmu.edu>
Ruihang Lai authoredThis PR introduces the CUDA IPC memory support in TVM runtime. IPC memory allows multiple distribtued workers accessing the GPU memory of each other directly. This functionality is helpful for implementing customzied communication primitives across distributed workers. In this PR, we bring the customized all-reduce implementation from TensorRT-LLM into 3rdparty. This all-reduce implementation makes use of the CUDA IPC memory. We expose the all-reduce function in global function under namespace `tvm::runtime::disco::cuda_ipc`. One unit test for the customized all-reduce kernel over two workers is added. --- Co-authored-by: Hongyi Jin <hongyij@andrew.cmu.edu>
This project is licensed under the Apache License 2.0.
Learn more
Loading