diff --git a/recipes/experimental/long-context/H2O/README.md b/recipes/experimental/long-context/H2O/README.md index aace1049e5e9cfd05e671cacfcaa63ae41b20295..675e1ef68138e6014e03bccc017aa4254c6a4599 100644 --- a/recipes/experimental/long-context/H2O/README.md +++ b/recipes/experimental/long-context/H2O/README.md @@ -26,7 +26,7 @@ python run_summarization.py \ ##### **Results** -Expected results on XSUM (Rouge-2 score, ther higher the better) from the above scripts on Llama-2/3 models. The sequence length of inputs are ~2k. Here we constrains the size of KV cache, allowing only n KVs to be write/read after the prefilling stage. n ranges from **64** to **full** where we maintain all the KV pairs. With 128 KVs, the performance can be matched as the full baseline (~2k KVs) while performance degradation is observed with 64 KVs. Also, maintaining a smaller KV cache reduces the I/O cost of KVs, thus we can achieve better throughput. +Expected results on XSUM (Rouge-2 score, the higher the better) from the above scripts on Llama-2/3 models. The sequence length of inputs are ~2k. Here we constrains the size of KV cache, allowing only n KVs to be write/read after the prefilling stage. n ranges from **64** to **full** where we maintain all the KV pairs. With 128 KVs, the performance can be matched as the full baseline (~2k KVs) while performance degradation is observed with 64 KVs. Also, maintaining a smaller KV cache reduces the I/O cost of KVs, thus we can achieve better throughput. | KV Cache Size | 64 | 128 | 256 | 512 | 1024 | Full | | ------------- | ------ | ------ | ------ | ------ | ------ | ------ |