大模型推理技术栈

Blog 📝

大模型推理技术栈

January 2, 2024

持续更新中……

优化技术

- 模型压缩
  - Pruning
  - Quantization
- 显存优化
  - [PagedAttention](https://doi.org/10.48550/arXiv.2309.06180)
  - Quantized K/V Cache
  - [Multi Query Attention (MQA)](https://doi.org/10.48550/arXiv.1911.02150)
    - [Grouped Query Attention (GQA)](https://doi.org/10.48550/arXiv.2305.13245)
  - [FlashAttention](https://doi.org/10.48550/arXiv.2205.14135)
    - [FlashAttention-2](https://doi.org/10.48550/arXiv.2307.08691)
    - [Flash-Decoding](https://crfm.stanford.edu/2023/10/12/flashdecoding.html)
    - [FlashDecoding++](https://doi.org/10.48550/arXiv.2311.01282)
- 调度优化
  - [Dynamic Batching](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#dynamic-batcher)
  - [Async Serving](https://towardsdatascience.com/async-calls-for-chains-with-langchain-3818c16062ed)
  - [Iteration batching (a.k.a. continuous batching)](https://friendli.ai/blog/llm-iteration-batching/)
- 高性能算子
  - Operator Fusion
- 模型编译
  - [XLA](https://www.tensorflow.org/xla?hl=zh-cn)
  - [MLC LLM](https://llm.mlc.ai/)

框架

- 并行训练框架
  - [DeepSpeed](https://github.com/microsoft/DeepSpeed)
  - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
  - [Colossal-AI](https://github.com/hpcaitech/ColossalAI)
  - [Alpa](https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin)
  - [GShard](https://doi.org/10.48550/arXiv.2006.16668)
    - [GSPMD](https://doi.org/10.48550/arXiv.2105.04663)
- 推理服务框架
  - [Orca](https://www.usenix.org/conference/osdi22/presentation/yu)
  - [LMDeploy](https://github.com/InternLM/lmdeploy)
  - [LightLLM](https://github.com/ModelTC/lightllm)
- 推理加速框架
  - [vLLM](https://github.com/vllm-project/vllm)
  - [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
    - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
  - [Text Generation Inference](https://github.com/huggingface/text-generation-inference)
  - [Lit-LLaMA](https://github.com/Lightning-AI/lit-llama)
  - [fastllm](https://github.com/ztxz16/fastllm)
  - [InferLLM](https://github.com/MegEngine/InferLLM)
  - [OpenPPL](https://openppl.ai/home)
  - [DeepSpeed-FastGen](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/chinese/README.md)
  - [ExLlama](https://github.com/turboderp/exllama)

挑战 2024 年考研数学（一）Towards Efficient Generative Large Language Model Serving: A Survey From Algorithms to Systems