Teaching 教学
System for Artificial Intelligence人工智能系统
人工智能系统System for Artificial Intelligence
Course Description课程简介
This course focuses on the system design and implementation that underpin modern artificial intelligence applications. Students will learn the design principles behind state-of-the-art machine learning systems and systematic performance optimization. Core topics cover the full-stack key technologies, from modern AI computing hardware architectures and programming paradigms to deep learning frameworks, compilers, and finally to clustered distributed training and inference. Through theoretical study and a series of hands-on projects, students will master the systematic methodology for transforming AI models into production-grade services. 本课程聚焦支撑现代人工智能应用的系统设计与实现。学生将学习先进机器学习系统背后的设计原则和系统化性能优化方法。核心内容覆盖全栈关键技术,包括现代 AI 计算硬件架构与编程范式、深度学习框架、机器学习编译器,以及集群化分布式训练与推理。通过理论学习和一系列动手项目,学生将掌握将 AI 模型转化为生产级服务的系统方法。
Learning Objectives学习目标
- Understand Core Principles: Gain a deep understanding of the system design principles that support modern AI applications (especially Large Language Models) and systematically master the key full-stack technologies, from the underlying hardware architecture and programming paradigms to the upper-level deep learning frameworks and compilers. 理解核心原理:深入理解支撑现代 AI 应用(尤其是大语言模型)的系统设计原则,并系统掌握从底层硬件架构与编程范式到上层深度学习框架和编译器的全栈关键技术。
- Master Optimization Techniques: Learn and master key performance optimization methods for machine learning systems, including how to effectively scale computation, reduce memory footprint, and perform efficient task offloading and scheduling on heterogeneous computing resources (such as CPUs, GPUs, and NPUs). 掌握优化技术:学习并掌握机器学习系统的关键性能优化方法,包括如何有效扩展计算、降低内存占用,以及在 CPU、GPU、NPU 等异构计算资源上进行高效任务卸载与调度。
- Develop Practical Skills: Through theoretical study and a series of hands-on projects, master the systematic methodology for transforming AI models into stable and efficient production-grade services, and possess the ability to design, implement, and deploy modern machine learning systems. 培养实践能力:通过理论学习和动手项目,掌握将 AI 模型转化为稳定高效生产级服务的系统方法,并具备设计、实现和部署现代机器学习系统的能力。
- Connect with Cutting-Edge Fields: Through the study of frontier case studies such as the training and servicing of Large Language Models (LLMs), students will become familiar with the latest technologies and challenges in the industry, laying a solid foundation of knowledge and skills for future application and research in the field of machine learning systems. 连接前沿领域:通过学习大语言模型训练与服务等前沿案例,了解工业界最新技术与挑战,为未来在机器学习系统领域的应用与研究打下坚实基础。
Syllabus课程大纲
-
07 Machine Learning Compilation机器学习编译
-
11 Parallelization and Training I: Data Parallelism, ZeRO, FSDP, Pipeline Parallelism并行化与训练 I:数据并行、ZeRO、FSDP、流水线并行
-
12 Parallelization and Training II: Tensor, Sequence, Expert Parallelism并行化与训练 II:张量、序列、专家并行
-
13 Parallelization and Training III: Activation Checkpointing, Mixed Precision, Checkpoint Saving并行化与训练 III:激活检查点、混合精度、检查点保存
-
16 Post-Training: Reinforcement Learning for LLMs后训练:面向大语言模型的强化学习
Assessment考核方式
- 1st unexcused absence: Warning
- 2nd unexcused absence: Grade halved
- 3rd unexcused absence: Grade becomes 0
- 第 1 次无故缺勤:警告
- 第 2 次无故缺勤:成绩减半
- 第 3 次无故缺勤:该项成绩为 0
- Proposal: 5%
- Technical Report: 20%
- Presentation: 20%
- 项目提案:5%
- 技术报告:20%
- 项目展示:20%