Efficient Machine Learning and Optimization

Logistics

Date: Friday, May 23rd, 2025

Location: Northwestern University, 3rd floor Mudd library (Room: 3514) 2233 Tech Dr, Evanston, IL 60208.

Parking: For those driving to the workshop, attendees can park in the North Campus garage 2311 N Campus Dr #2300, Evanston, IL 60208. https://maps.northwestern.edu/txt/facility/646 You’ll exit the garage on the opposite side from the car entrance and you’ll see Mudd Library directly in front of you across a grassy lawn area. Take the elevator to your right in the library lobby to the 3rd floor.

Parking passes will be provided at the workshop for free parking in designated NU parking building. Please remember to ask for a pass before leaving the workshop.

Registration link

Viewing Link:https://northwestern.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=aaa69a4f-b0f0-4e7d-9401-b2e5013ded91

YouTube:

Description:

This workshop will bring together researchers and practitioners to discuss recent advances in energy-efficient machine learning (ML). As ML models grow in scale and complexity, optimizing their energy consumption has become a critical research challenge. Topics will include model compression, quantization, hardware-aware neural architectures, sustainable AI frameworks, and energy-efficient inference techniques. The workshop will feature invited talks and ample time for networking.

Schedule:

8:30 Coffee and Pastries

8:45-9:00 Opening and introductions

9:00-9:45 Keynote 1: Zhiyu Cheng (NVIDIA)

9:45-10:30 Inna Partin-Vaisband (UIC)

Break 10:30-10:45

10:45-11:30 Keynote 2: Manzil Zaheer (Google research)

11:30- 12:15 Kexin Pei (U Chicago)

12:15-1:45 Lunch

1:45-2:30 Keynote 3: Mosharaf Chowdhury (U Michigan)

2:30-3:15 Bing Liu (UIC)

Break 3:15-3:30

3:30-4:15 Tian Li (U Chicago)

4:15-5:00 Poster session

Organizers:

Natasha Devroye (UIC)
Ermin Wei (Northwestern University)
Ren Wang (IIT)
Tian Li (University of Chicago)

Abstracts:

Speaker: Zhiyu Cheng

Title: FP4 quantization and its real-world applications on LLMs and diffusion models

Abstract: As large language models (LLMs) and diffusion models grow in complexity, efficient inference has become a pressing concern. In this talk, we introduce FP4 quantization — a low-precision quantization technique that substantially reduces memory usage and computational costs with minimal accuracy trade-offs. We begin by discussing the FP4 numerical format on Nvidia Blackwell GPUs. Next, we delve into the quantization workflow, highlighting both post-training quantization (PTQ) and quantization-aware training (QAT) algorithms, along with practical recipes and best practices for successful implementation on LLMs and diffusion models. We then present quantitative and qualitative results to illustrate FP4 quantization’s impact on real-world generative AI applications. Finally, we introduce the NVIDIA TensorRT Model Optimizer, detailing its capabilities for FP4 quantization and streamlined deployment through inference frameworks such as TensorRT-LLM, SGLang and vLLM.

Bio: Zhiyu Cheng is a manager at NVIDIA, where he focuses on driving algorithms and software development to optimize deep learning models inference, including large language models (LLMs), vision language models (VLMs), and diffusion models on Nvidia’s latest platforms. He has over 11 years of industry experience in efficient deep learning across his career from NXP, Xilinx, Baidu and OmniML (acquired by Nvidia). Zhiyu has a record of over 30 published papers and patents. He holds a Ph.D. degree in electrical and computer engineering from the University of Illinois at Chicago with a thesis in the field of information theory.

************

Speaker: Inna Partin-Vaisband

Title: Analog AI at the Edge: Training to the Rescue

Abstract: Analog and mixed‑signal integrated circuits offer compact, energy‑efficient AI inference at the edge by eliminating costly data transfers and memory bottlenecks. These advantages, however, are challenged by sensitivity to process‑voltage‑temperature variations, device noise, and analog non‑idealities. This talk presents an online training framework that continuously calibrates analog models on‑chip, compensating for variations and noise without ADC/DAC overhead. Demonstrated on a multilayer perceptron for image classification, the method achieves accuracy comparable to a 6‑bit resolution digital classifier with only a fraction of the power and area. The approach generalizes to convolutional neural networks and complex benchmarks—enabling robust, energy‑efficient edge AI for diverse applications. Future directions toward deeper networks, advanced devices, and system‑level integration will also be discussed.

Bio:

Inna Partin‑Vaisband is an Associate Professor of Electrical and Computer Engineering and an Adjunct Professor of Computer Science at the University of Illinois Chicago. She earned her B.Sc. in Computer Science and M.Sc. in Electrical Engineering from the Technion–Israel Institute of Technology, and her Ph.D. in Electrical and Computer Engineering from the University of Rochester. Her research focuses on AI‑accelerated hardware, analog and mixed‑signal circuit design, hardware security, and integrated power delivery, with applications in edge‑inference and chiplet‑based systems. She is the author of On‑Chip Power Delivery and Management (4th Ed.), and her distributed on‑chip power‑supply architectures have been deployed in commercial mobile SoCs. Her work on chiplet‑based systems was featured in “The Chiplet Revolution” article in Communications of the ACM (2024). Dr. Partin‑Vaisband serves as an Associate Editor for Microelectronics Journal and IEEE Transactions on CPMT, and is a recipient of the 2022 Google Research Scholar Award and the 2023 NSF CAREER Award.

************

Speaker: Kexin Pei

Title: Analyzing and Optimizing Software via Robust and Sample-Efficient Learning

Abstract: Language Models (LMs) have shown exciting applications in software engineering and software security. These applications heavily depend on their understanding of program semantics for precise code reasoning. However, the existing data-driven training paradigm suffers from inefficient learning due to the inherent symbolic nature of program semantics. In this talk, I will first introduce SymC, our recent work in tackling the challenge of efficiently teaching code semantics to LMs by incorporating code symmetries into the LM architecture. We define code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. Our results on five program analysis tasks suggest that LMs that encode the code structural prior via the code symmetry group generalize better and faster. Next, I will describe our effort in learning efficiency-improving code editing. Specifically, we constrain the LM’s code reasoning space as explicit code editing rules and employ LLMs as an inductive rule learner to extract efficiency-improving code editing rules from the training code pairs as concise meta-rule sets. Such rule sets will be manifested as code-editing steps to augment the training samples further. We demonstrate that our approach outperforms the state-of-the-art in several critical code editing tasks, including those that aim to improve code efficiency.

************

Speaker: Mosharaf Chowdhury

Title: Toward Energy-Optimal AI Systems

Abstract: Generative AI adoption and its energy consumption are skyrocketing. For instance, training GPT-3, a precursor to ChatGPT, consumed an estimated 1.3 GWh of electricity in 2020. By 2022, Amazon trained a large language model (LLM) that consumed 11.9 GWh, enough to power over a thousand U.S. households for a year. AI inference consumes even more energy, because a model trained once serves millions. This surge has broad implications. First, energy-intensive AI workloads inflate carbon offsetting costs for entities with Net Zero commitments. Second, power delivery is now the gating factor toward building new AI supercomputers. Finally, this hinders deploying AI services in places without high-capacity electricity grids, leading to inequitable access to AI services.

In this talk, I will introduce the ML Energy Initiative, our effort to understand AI’s energy consumption and build a sustainable future by curtailing AI’s runaway energy demands. I will introduce tools to precisely measure AI’s energy consumption and findings from using them on open-weights models, algorithms to find and navigate the Pareto frontier of AI’s energy consumption, and the tradeoff between performance and energy consumption during model training. I will also touch upon our solutions to make AI systems failure-resilient to reduce energy waste from idling. This talk is a call to arms to collaboratively build energy-optimal AI systems for a sustainable and equitable future.

Bio: Mosharaf Chowdhury is an Associate Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor, where he leads the SymbioticLab. His current research focuses on improving the efficiency of AI/ML workloads, specifically optimizing their energy consumption through the ML Energy Initiative. Major open-source projects from his team include Infiniswap, the first scalable memory disaggregation solution; FedScale, a planetary-scale AI/ML platform; TPP, the tiered memory manager in the Linux kernel v5.18 onward; and Zeus, the first energy-optimal Generative AI stack. In the past, Mosharaf invented coflows and was one of the original creators of Apache Spark. He has received numerous individual awards, fellowships, and paper awards from NSDI, OSDI, ATC, and MICRO.

************

Speaker: Bing Liu

Title: Accurate and Sustainable Continual Learning

Abstract: Continual learning (CL) seeks to empower AI systems to learn tasks incrementally, a vital capability for developing more advanced and adaptive intelligent behaviors. However, the persistent challenge of catastrophic forgetting has significantly constrained the accuracy of current CL methods, keeping them far below the theoretical upper bound achieved by joint training. This limitation has largely hindered—or even prevented—the practical adoption of CL in real-world applications. By leveraging large foundation models, we recently proposed a simple yet effective CL approach that not only matches the accuracy of joint training but also is remarkably energy efficiency. This not only unlocks the potential for real-world CL applications but also provides profound insights into the foundational principles of AI.

Bio: Bing Liu is a Distinguished Professor and Peter L. and Deborah K. Wexler Professor of Computing at the University of Illinois Chicago (UIC). He earned his Ph.D. in Artificial Intelligence from the University of Edinburgh. His current research interests include continual or lifelong learning, learning to reason, dialogue systems, machine learning, and natural language processing. He is the author of several books on these topics and has also received multiple Test-of-Time awards for his papers. Liu is the 2018 recipient of the ACM SIGKDD Innovation Award and a Fellow of ACM, AAAI, and IEEE

************

Speaker: Tian Li

Title: Efficient Distributed Optimization under Heavy-Tailed Noise

Abstract: Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization or clipping techniques. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with potentially unbounded gradient variance and local updates. Among its variants, we highlight a memory and communication efficient instantiation which we call Bi2Clip, which performs coordinate-wise clipping at both the inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the cost of maintaining or transmitting additional gradient statistics. Empirically, TailOPT, including Bi2Clip, demonstrates superior performance on several language tasks and models, outperforming state-of-the-art methods.

Bio: Tian Li is an Assistant Professor at the Computer Science Department and Data Science Institute at the University of Chicago. Her research interests are in optimization, trustworthy machine learning, and collaborative learning. She has spent time at Microsoft Research Asia, Google Research, and Meta Foundational AI Research Labs. She was invited to participate in the EECS Rising Stars Workshop, and was recognized as a Rising Star in Machine Learning/Data Science by multiple institutions. Her team won the first place in the Privacy-Enhancing Technologies (PETs) Challenge featured by the White House. She received her PhD in Computer Science from Carnegie Mellon University and BS degrees in Computer Science and Economics from Peking University.

Parking visual for NU:

Efficient Machine Learning and Optimization

Join Our Newsletter

Success!

Special Program Announcement

Click here to view the exciting series of workshops, courses, seminars and other activities!