System for Artificial Intelligence

人工智能系统

Course Description

This course focuses on the system design and implementation that underpin modern artificial intelligence applications. Students will learn the design principles behind state-of-the-art machine learning systems and systematic performance optimization. Core topics cover the full-stack key technologies, from modern AI computing hardware architectures and programming paradigms to deep learning frameworks, compilers, and finally to clustered distributed training and inference. Through theoretical study and a series of hands-on projects, students will master the systematic methodology for transforming AI models into production-grade services.

Learning Objectives

Understand Core Principles: Gain a deep understanding of the system design principles that support modern AI applications (especially Large Language Models) and systematically master the key full-stack technologies, from the underlying hardware architecture and programming paradigms to the upper-level deep learning frameworks and compilers.
Master Optimization Techniques: Learn and master key performance optimization methods for machine learning systems, including how to effectively scale computation, reduce memory footprint, and perform efficient task offloading and scheduling on heterogeneous computing resources (such as CPUs, GPUs, and NPUs).
Develop Practical Skills: Through theoretical study and a series of hands-on projects, master the systematic methodology for transforming AI models into stable and efficient production-grade services, and possess the ability to design, implement, and deploy modern machine learning systems.
Connect with Cutting-Edge Fields: Through the study of frontier case studies such as the training and servicing of Large Language Models (LLMs), students will become familiar with the latest technologies and challenges in the industry, laying a solid foundation of knowledge and skills for future application and research in the field of machine learning systems.

Syllabus

I Foundations (3 lectures)

00 Course Introduction

Slides
01 Introduction to System for AI

Slides
02 Automatic Differentiation and Deep Learning Framework

Slides

II Hardware Acceleration & Programming (4 lectures)

03 Hardware Acceleration

Slides
04 GPU Architecture and CUDA Programming

Slides
05 CUDA Case Study: Matrix Multiplication on GPU

Slides
06 NPU Architecture and Ascend C Programming

Slides

III Machine Learning Compilation (1 lecture)

07 Machine Learning Compilation

Slides

IV LLM Fundamentals & Distributed Computing (3 lectures)

08 Introduction to LLMs and General Optimizations

Slides
09 Attention Optimizations (FlashAttention, FlashMLA)

Slides
10 Introduction to Distributed Computing

Slides

V LLM Parallelization & Training Techniques (3 lectures)

11 Parallelization and Training I: Data Parallelism, ZeRO, FSDP, Pipeline Parallelism

Slides
12 Parallelization and Training II: Tensor, Sequence, Expert Parallelism

Slides
13 Parallelization and Training III: Activation Checkpointing, Mixed Precision, Checkpoint Saving

Slides

VI LLM Serving Techniques (2 lectures)

14 Serving I: Prefill/Decode, KV Cache, Paged KV Cache

Slides
15 Serving II: Quantization, Continuous Batching, Speculative Decoding

Slides

VII Post-Training & Project (1 lecture)

16 Post-Training: Reinforcement Learning for LLMs

Slides

Assessment

10%

Class Participation

1st unexcused absence: Warning
2nd unexcused absence: Grade halved
3rd unexcused absence: Grade becomes 0

45%

Assignments

Three programming assignments, each worth 15%.

45%

Course Project

Groups of 2-3 students:

Proposal: 5%
Technical Report: 20%
Presentation: 20%

Teaching

System for Artificial Intelligence

Course Description

Learning Objectives

Syllabus

Assessment

References & Resources

Machine Learning / Deep Learning

Machine Learning Systems