The Local LLM Index / Quantization & Formats / #131

psmarter/mini-infer

by psmarter · Quantization & Formats · updated 1mo ago

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

momentum

270

stars

forks

#131

rank

continuous-batchingcudainferenceinference-enginekv-cachelanguage-modelllmmachine-learningmoepagedattentionpytorchquantization

View on GitHub →

psmarter/mini-infer

More in Quantization & Formats