Logistics
Date: Friday, May 23rd, 2025
Location: Northwestern University, 3rd floor Mudd library (Room: 3514) 2233 Tech Dr, Evanston, IL 60208.
Parking: For those driving to the workshop, attendees can park in the North Campus garage 2311 N Campus Dr #2300, Evanston, IL 60208. https://maps.northwestern.edu/
Parking passes will be provided at the workshop for free parking in designated NU parking building. Please remember to ask for a pass before leaving the workshop.
Viewing Link:https://northwestern.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=aaa69a4f-b0f0-4e7d-9401-b2e5013ded91
YouTube:
Description:
This workshop will bring together researchers and practitioners to discuss recent advances in energy-efficient machine learning (ML). As ML models grow in scale and complexity, optimizing their energy consumption has become a critical research challenge. Topics will include model compression, quantization, hardware-aware neural architectures, sustainable AI frameworks, and energy-efficient inference techniques. The workshop will feature invited talks and ample time for networking.
8:30 Coffee and Pastries
8:45-9:00 Opening and introductions
9:00-9:45 Keynote 1: Zhiyu Cheng (NVIDIA)
9:45-10:30 Inna Partin-Vaisband (UIC)
Break 10:30-10:45
10:45-11:30 Keynote 2: Manzil Zaheer (Google research)
11:30- 12:15 Kexin Pei (U Chicago)
12:15-1:45 Lunch
1:45-2:30 Keynote 3: Mosharaf Chowdhury (U Michigan)
2:30-3:15 Bing Liu (UIC)
Break 3:15-3:30
3:30-4:15 Tian Li (U Chicago)
4:15-5:00 Poster session
Organizers:
- Natasha Devroye (UIC)
- Ermin Wei (Northwestern University)
- Ren Wang (IIT)
- Tian Li (University of Chicago)
Abstracts:
Speaker: Zhiyu Cheng
Title: FP4 quantization and its real-world applications on LLMs and diffusion models
Abstract: As large language models (LLMs) and diffusion models grow in complexity, efficient inference has become a pressing concern. In this talk, we introduce FP4 quantization — a low-precision quantization technique that substantially reduces memory usage and computational costs with minimal accuracy trade-offs. We begin by discussing the FP4 numerical format on Nvidia Blackwell GPUs. Next, we delve into the quantization workflow, highlighting both post-training quantization (PTQ) and quantization-aware training (QAT) algorithms, along with practical recipes and best practices for successful implementation on LLMs and diffusion models. We then present quantitative and qualitative results to illustrate FP4 quantization’s impact on real-world generative AI applications. Finally, we introduce the NVIDIA TensorRT Model Optimizer, detailing its capabilities for FP4 quantization and streamlined deployment through inference frameworks such as TensorRT-LLM, SGLang and vLLM.
Bio: Zhiyu Cheng is a manager at NVIDIA, where he focuses on driving algorithms and software development to optimize deep learning models inference, including large language models (LLMs), vision language models (VLMs), and diffusion models on Nvidia’s latest platforms. He has over 11 years of industry experience in efficient deep learning across his career from NXP, Xilinx, Baidu and OmniML (acquired by Nvidia). Zhiyu has a record of over 30 published papers and patents. He holds a Ph.D. degree in electrical and computer engineering from the University of Illinois at Chicago with a thesis in the field of information theory.
************
Speaker: Inna Partin-Vaisband
Title: Analog AI at the Edge: Training to the Rescue
Abstract: Analog and mixed‑signal integrated circuits offer compact, energy‑efficient AI inference at the edge by eliminating costly data transfers and memory bottlenecks. These advantages, however, are challenged by sensitivity to process‑voltage‑temperature variations, device noise, and analog non‑idealities. This talk presents an online training framework that continuously calibrates analog models on‑chip, compensating for variations and noise without ADC/DAC overhead. Demonstrated on a multilayer perceptron for image classification, the method achieves accuracy comparable to a 6‑bit resolution digital classifier with only a fraction of the power and area. The approach generalizes to convolutional neural networks and complex benchmarks—enabling robust, energy‑efficient edge AI for diverse applications. Future directions toward deeper networks, advanced devices, and system‑level integration will also be discussed.
Bio:
Inna Partin‑Vaisband is an Associate Professor of Electrical and Computer Engineering and an Adjunct Professor of Computer Science at the University of Illinois Chicago. She earned her B.Sc. in Computer Science and M.Sc. in Electrical Engineering from the Technion–Israel Institute of Technology, and her Ph.D. in Electrical and Computer Engineering from the University of Rochester. Her research focuses on AI‑accelerated hardware, analog and mixed‑signal circuit design, hardware security, and integrated power delivery, with applications in edge‑inference and chiplet‑based systems. She is the author of On‑Chip Power Delivery and Management (4th Ed.), and her distributed on‑chip power‑supply architectures have been deployed in commercial mobile SoCs. Her work on chiplet‑based systems was featured in “The Chiplet Revolution” article in Communications of the ACM (2024). Dr. Partin‑Vaisband serves as an Associate Editor for Microelectronics Journal and IEEE Transactions on CPMT, and is a recipient of the 2022 Google Research Scholar Award and the 2023 NSF CAREER Award.
************
************
Speaker: Mosharaf Chowdhury
Title: Toward Energy-Optimal AI Systems
Abstract: Generative AI adoption and its energy consumption are skyrocketing. For instance, training GPT-3, a precursor to ChatGPT, consumed an estimated 1.3 GWh of electricity in 2020. By 2022, Amazon trained a large language model (LLM) that consumed 11.9 GWh, enough to power over a thousand U.S. households for a year. AI inference consumes even more energy, because a model trained once serves millions. This surge has broad implications. First, energy-intensive AI workloads inflate carbon offsetting costs for entities with Net Zero commitments. Second, power delivery is now the gating factor toward building new AI supercomputers. Finally, this hinders deploying AI services in places without high-capacity electricity grids, leading to inequitable access to AI services.
In this talk, I will introduce the ML Energy Initiative, our effort to understand AI’s energy consumption and build a sustainable future by curtailing AI’s runaway energy demands. I will introduce tools to precisely measure AI’s energy consumption and findings from using them on open-weights models, algorithms to find and navigate the Pareto frontier of AI’s energy consumption, and the tradeoff between performance and energy consumption during model training. I will also touch upon our solutions to make AI systems failure-resilient to reduce energy waste from idling. This talk is a call to arms to collaboratively build energy-optimal AI systems for a sustainable and equitable future.
Bio: Mosharaf Chowdhury is an Associate Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor, where he leads the SymbioticLab. His current research focuses on improving the efficiency of AI/ML workloads, specifically optimizing their energy consumption through the ML Energy Initiative. Major open-source projects from his team include Infiniswap, the first scalable memory disaggregation solution; FedScale, a planetary-scale AI/ML platform; TPP, the tiered memory manager in the Linux kernel v5.18 onward; and Zeus, the first energy-optimal Generative AI stack. In the past, Mosharaf invented coflows and was one of the original creators of Apache Spark. He has received numerous individual awards, fellowships, and paper awards from NSDI, OSDI, ATC, and MICRO.
************
Speaker: Bing Liu
Title: Accurate and Sustainable Continual Learning
Abstract: Continual learning (CL) seeks to empower AI systems to learn tasks incrementally, a vital capability for developing more advanced and adaptive intelligent behaviors. However, the persistent challenge of catastrophic forgetting has significantly constrained the accuracy of current CL methods, keeping them far below the theoretical upper bound achieved by joint training. This limitation has largely hindered—or even prevented—the practical adoption of CL in real-world applications. By leveraging large foundation models, we recently proposed a simple yet effective CL approach that not only matches the accuracy of joint training but also is remarkably energy efficiency. This not only unlocks the potential for real-world CL applications but also provides profound insights into the foundational principles of AI.
Bio: Bing Liu is a Distinguished Professor and Peter L. and Deborah K. Wexler Professor of Computing at the University of Illinois Chicago (UIC). He earned his Ph.D. in Artificial Intelligence from the University of Edinburgh. His current research interests include continual or lifelong learning, learning to reason, dialogue systems, machine learning, and natural language processing. He is the author of several books on these topics and has also received multiple Test-of-Time awards for his papers. Liu is the 2018 recipient of the ACM SIGKDD Innovation Award and a Fellow of ACM, AAAI, and IEEE
************
Parking visual for NU: