inline::vllm
Description
vLLM inference provider for high-performance model serving with PagedAttention and continuous batching.
Configuration
Field |
Type |
Required |
Default |
Description |
---|---|---|---|---|
|
|
No |
1 |
Number of tensor parallel replicas (number of GPUs to use). |
|
|
No |
4096 |
Maximum number of tokens to generate. |
|
|
No |
4096 |
Maximum context length to use during serving. |
|
|
No |
4 |
Maximum parallel batch size for generation. |
|
|
No |
False |
Whether to use eager mode for inference (otherwise cuda graphs are used). |
|
|
No |
0.3 |
How much GPU memory will be allocated when this provider has finished loading, including memory that was already allocated before loading. |
Sample Configuration
tensor_parallel_size: ${env.TENSOR_PARALLEL_SIZE:=1}
max_tokens: ${env.MAX_TOKENS:=4096}
max_model_len: ${env.MAX_MODEL_LEN:=4096}
max_num_seqs: ${env.MAX_NUM_SEQS:=4}
enforce_eager: ${env.ENFORCE_EAGER:=False}
gpu_memory_utilization: ${env.GPU_MEMORY_UTILIZATION:=0.3}