System for Artificial Intelligence

人工智能系统

Course Description

This course focuses on the system design and implementation that underpin modern artificial intelligence applications. Students will learn the design principles behind state-of-the-art machine learning systems and systematic performance optimization. Core topics cover the full-stack key technologies, from modern AI computing hardware architectures and programming paradigms to deep learning frameworks, compilers, and finally to clustered distributed training and inference. Through theoretical study and a series of hands-on projects, students will master the systematic methodology for transforming AI models into production-grade services.

Learning Objectives

  1. Understand Core Principles: Gain a deep understanding of the system design principles that support modern AI applications (especially Large Language Models) and systematically master the key full-stack technologies, from the underlying hardware architecture and programming paradigms to the upper-level deep learning frameworks and compilers.
  2. Master Optimization Techniques: Learn and master key performance optimization methods for machine learning systems, including how to effectively scale computation, reduce memory footprint, and perform efficient task offloading and scheduling on heterogeneous computing resources (such as CPUs, GPUs, and NPUs).
  3. Develop Practical Skills: Through theoretical study and a series of hands-on projects, master the systematic methodology for transforming AI models into stable and efficient production-grade services, and possess the ability to design, implement, and deploy modern machine learning systems.
  4. Connect with Cutting-Edge Fields: Through the study of frontier case studies such as the training and servicing of Large Language Models (LLMs), students will become familiar with the latest technologies and challenges in the industry, laying a solid foundation of knowledge and skills for future application and research in the field of machine learning systems.

Syllabus

I Foundations (3 lectures)
  • 00 Course Introduction
  • 01 Introduction to System for AI
  • 02 Automatic Differentiation and Deep Learning Framework
II Hardware Acceleration & Programming (4 lectures)
  • 03 Hardware Acceleration
  • 04 GPU Architecture and CUDA Programming
  • 05 CUDA Case Study: Matrix Multiplication on GPU
  • 06 NPU Architecture and Ascend C Programming
III Machine Learning Compilation (1 lecture)
  • 07 Machine Learning Compilation
IV LLM Fundamentals & Distributed Computing (3 lectures)
  • 08 Introduction to LLMs and General Optimizations
  • 09 Attention Optimizations (FlashAttention, FlashMLA)
  • 10 Introduction to Distributed Computing
V LLM Parallelization & Training Techniques (3 lectures)
  • 11 Parallelization and Training I: Data Parallelism, ZeRO, FSDP, Pipeline Parallelism
  • 12 Parallelization and Training II: Tensor, Sequence, Expert Parallelism
  • 13 Parallelization and Training III: Activation Checkpointing, Mixed Precision, Checkpoint Saving
VI LLM Serving Techniques (2 lectures)
  • 14 Serving I: Prefill/Decode, KV Cache, Paged KV Cache
  • 15 Serving II: Quantization, Continuous Batching, Speculative Decoding
VII Post-Training & Project (1 lecture)
  • 16 Post-Training: Reinforcement Learning for LLMs

Assessment

10%
Class Participation
  • 1st unexcused absence: Warning
  • 2nd unexcused absence: Grade halved
  • 3rd unexcused absence: Grade becomes 0
45%
Assignments
Three programming assignments, each worth 15%.
45%
Course Project
Groups of 2-3 students:
  • Proposal: 5%
  • Technical Report: 20%
  • Presentation: 20%

References & Resources