vLLM – virtual Large Language Model (LLM). The vLLM technology was developed at UC Berkeley as “an open source library for fast LLM inference and serving” and is now an open source project. According to Red Hat, it “is an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.”
Red Hat says: “Essentially, vLLM works as a set of instructions that encourage the KV (KeyValue) cache to create shortcuts by continuously ‘batching’ user responses.” The KV cache is a “short-term memory of an LLM [which] shrinks and grows during throughput.”