0:00
11:23
11:23

LLM Compression Explained: Build Faster, Efficient AI Models

Tech

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam → https://ibm.biz/Bdpsig Learn more about Small Language Models here → https://ibm.biz/Bdpsih Shrink massive AI models with ease! ⚡ Cedric Clyburn explains LLM compression and quantization techniques to optimize performance. Learn how to deploy scalable AI with cutting-edge methods for real-world applications! AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/BdpsiV #llm #aioptimization #scalableai

ADVERTISEMENT

Comments 22

Sign in to join the conversation

Sign in
J
julianafliegner841 2 months ago

Video starts at 5:27

A
aimée.foucher 2 months ago

Great stuff; thanks.

B
brianmartin440 2 months ago

Very impressive

B
benitosolorzano76 2 months ago

Great explanation! God bless everyone!

angela.patterson
angela.patterson 2 months ago

I ran qwen3 on my laptop without gpu with only 8gb ram and it ran smoothly. Of course my os is Linux.

H
hans-heinrich.segebahn 2 months ago

Nice, so I can get 2+2=5 faster , instead of 2+2=4, juhu, User will be happy, right? Imagine a calculator with option to run faster but with no guarantee calculations are correct !

Z
zoé_rousset 2 months, 1 week ago

Thanks for watching folks! Apologies about the small verbal typo there, meant to say 10 GPU's :)

R
robert.maldonado 2 months, 1 week ago

How about TinyML. Any related to Model compression and quantization?

G
grégoire_louis 2 months, 1 week ago

Informational video thanks ! 😊

francisca_gonzález
francisca_gonzález 2 months, 1 week ago

thanks

M
maríadelcarmenuribe801 2 months, 1 week ago

I would love to see a video about Vulkan, particularly with respect to its ability to run on a wide variety of different Hardware

C
charles_renard 2 months, 1 week ago

My life is better thanks to IBM Technology. Thank you <3

M
mohammed.barrett 2 months, 1 week ago

يجب أن تذكر ايضا الجانب السلبي أن الدقة تصبح أقل و ربما اخطاء تظهر ، شكرا على المعلومات القيمة ، رغم لا علاقة لي بالبرمجة الا انني استفيد لاني مهتم بالتقنية بشكل عام

agathe_marion
agathe_marion 2 months, 1 week ago

Perfect!

mariacecíliaalbuquerque427
mariacecíliaalbuquerque427 2 months, 1 week ago

Here I would stress that everything is a tradeoff. And while it was said in the video that you are loosing accuracy by the quantization, I still feel that it has to be repeated over and over that by doing this you are loosing accuracy and there might be use cases where accuracy on broad categories of topics is very important. Also in most cases you need a AI model that is a specialist on just few topics / things. So reducing the number of model parameters would be also a path to think about. But in regards to the optimization - how much to scale down the size of each parameter would be a nice topic to touch upon.

paulvaleon6
paulvaleon6 2 months, 1 week ago

This is one of the few explanations that treats LLMs as real deployment systems where inference cost and latency dominate, especially in multi-agent architectures where delays compound across nodes. I’ve been building similar AI systems, and the quantization trade-offs here match exactly what makes the difference between a prototype and a production-ready system

C
christopher_thompson 2 months, 1 week ago

Easy to understand. Thank you

gerolfechoing17
gerolfechoing17 2 months, 1 week ago

Uh, 800GB divided by 80GB/GPU equals ten GPUs by my arithmetic.... And that's just the weights. You'd need more for your KV cache, I think. So something like at least 12 80GB GPUs.

C
carmen.vigil 2 months, 1 week ago

800 GB = 10 A100 (minimum) not five. That is also a $12,000 - $15,000 per month pile of GPUs , before you do anything with it.

C
christy_cooper 2 months, 1 week ago

Nice positive info thx