|
- Optimization and Tuning - vLLM
By increasing utilization, you can provide more KV cache space Decrease max_num_seqs or max_num_batched_tokens This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space Increase tensor_parallel_size This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache
- Default vLLM needs to allocate for the full KVCache #9525
Default vLLM needs to allocate for the full KVCache I thought vLLMs dynamic mgmt of memory would allow the GPU memory to handle as much KVCache as it could aka Growing the KV Cache one page at a time as the KVCache grows and constantly deleting fini
- Which arguments affect GPU memory - General - vLLM Forums
The memory usage on each GPU can be influenced by several factors beyond just the model weights According to vLLM documentation, the gpu_memory_utilization parameter controls the fraction of GPU memory used for the model executor, which includes model weights, activations, and KV cache Additionally, enabling CUDA graphs or using certain quantization methods can increase memory usage
- About monitor the usage of KV cache memory - General - vLLM Forums
For real-time monitoring, vLLM does not provide a built-in API to directly report live KV cache usage, but you can observe overall GPU memory usage with tools like nvidia-smi and infer KV cache utilization from the difference after model loading and during inference The logs will show lines like: “model weights take X GiB; … the rest of
- Conserving Memory - vLLM
Conserving Memory¶ Large models might cause your machine to run out of memory (OOM) Here are some options that help alleviate this problem Tensor Parallelism (TP)¶ Tensor parallelism (tensor_parallel_size option) can be used to split the model across multiple GPUs The following code splits the model across 2 GPUs
- Why does GPU memory increase when setting num-gpu-blocks? · vllm . . .
I've set --num-gpu-blocks-override directly so that vLLM preallocates 90% of my GPU on startup (model weight and KV Cache) In addition, I've set --gpu-memory-utilization to 0 9 However, under load the system keeps allocating more and more memory and eventually hits an OOM
- Optimization and Tuning — vLLM
By increasing this utilization, you can provide more KV cache space Decrease max_num_seqs or max_num_batched_tokens This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space Increase tensor_parallel_size This approach shards model weights, so each GPU has more memory available for KV cache
- Multimodal inference guideline? - General - vLLM Forums
If you set max_model_len=30000 but still get cut-off output, it’s likely your GPU does not have enough memory to allocate a KV cache for such a long context, so vLLM silently reduces the effective context length or fails to allocate enough cache You can check the logs for lines like “The model’s max seq len (30000) is larger than the
|
|
|