大模型推理技术栈
大模型推理技术栈
January 2, 2024
持续更新中……
优化技术
- 模型压缩
- Pruning
- Quantization
- 显存优化
- [PagedAttention](https://doi.org/10.48550/arXiv.2309.06180)
- Quantized K/V Cache
- [Multi Query Attention (MQA)](https://doi.org/10.48550/arXiv.1911.02150)
- [Grouped Query Attention (GQA)](https://doi.org/10.48550/arXiv.2305.13245)
- [FlashAttention](https://doi.org/10.48550/arXiv.2205.14135)
- [FlashAttention-2](https://doi.org/10.48550/arXiv.2307.08691)
- [Flash-Decoding](https://crfm.stanford.edu/2023/10/12/flashdecoding.html)
- [FlashDecoding++](https://doi.org/10.48550/arXiv.2311.01282)
- 调度优化
- [Dynamic Batching](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#dynamic-batcher)
- [Async Serving](https://towardsdatascience.com/async-calls-for-chains-with-langchain-3818c16062ed)
- [Iteration batching (a.k.a. continuous batching)](https://friendli.ai/blog/llm-iteration-batching/)
- 高性能算子
- Operator Fusion
- 模型编译
- [XLA](https://www.tensorflow.org/xla?hl=zh-cn)
- [MLC LLM](https://llm.mlc.ai/)
框架
- 并行训练框架
- [DeepSpeed](https://github.com/microsoft/DeepSpeed)
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
- [Colossal-AI](https://github.com/hpcaitech/ColossalAI)
- [Alpa](https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin)
- [GShard](https://doi.org/10.48550/arXiv.2006.16668)
- [GSPMD](https://doi.org/10.48550/arXiv.2105.04663)
- 推理服务框架
- [Orca](https://www.usenix.org/conference/osdi22/presentation/yu)
- [LMDeploy](https://github.com/InternLM/lmdeploy)
- [LightLLM](https://github.com/ModelTC/lightllm)
- 推理加速框架
- [vLLM](https://github.com/vllm-project/vllm)
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
- [Text Generation Inference](https://github.com/huggingface/text-generation-inference)
- [Lit-LLaMA](https://github.com/Lightning-AI/lit-llama)
- [fastllm](https://github.com/ztxz16/fastllm)
- [InferLLM](https://github.com/MegEngine/InferLLM)
- [OpenPPL](https://openppl.ai/home)
- [DeepSpeed-FastGen](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/chinese/README.md)
- [ExLlama](https://github.com/turboderp/exllama)