0:00

11:23

LLM Compression Explained: Build Faster, Efficient AI Models

Tech

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam → https://ibm.biz/Bdpsig Learn more about Small Language Models here → https://ibm.biz/Bdpsih Shrink massive AI models with ease! ⚡ Cedric Clyburn explains LLM compression and quantization techniques to optimize performance. Learn how to deploy scalable AI with cutting-edge methods for real-world applications! AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/BdpsiV #llm #aioptimization #scalableai

Comments 22 mariacecíliaalbuquerque427: Here I would stress that everything is a tradeoff. And whil…

Comments 22

mariacecíliaalbuquerque427 3 months, 3 weeks ago

Here I would stress that everything is a tradeoff. And while it was said in the video that you are loosing accuracy by the quantization, I still feel that it has to be repeated over and over that by doing this you are loosing accuracy and there might be use cases where accuracy on broad categories of topics is very important. Also in most cases you need a AI model that is a specialist on just few topics / things. So reducing the number of model parameters would be also a path to think about. But in regards to the optimization - how much to scale down the size of each parameter would be a nice topic to touch upon.

zoé_rousset 3 months, 3 weeks ago

Thanks for watching folks! Apologies about the small verbal typo there, meant to say 10 GPU's :)

udarshsolara37 3 months, 3 weeks ago

Great video, thanks, looking for more

christopher_thompson 3 months, 3 weeks ago

Easy to understand. Thank you

charles_renard 3 months, 3 weeks ago

My life is better thanks to IBM Technology. Thank you <3

paulvaleon6 3 months, 3 weeks ago

This is one of the few explanations that treats LLMs as real deployment systems where inference cost and latency dominate, especially in multi-agent architectures where delays compound across nodes. I’ve been building similar AI systems, and the quantization trade-offs here match exactly what makes the difference between a prototype and a production-ready system

grégoire_louis 3 months, 3 weeks ago

Informational video thanks ! 😊

maríadelcarmenuribe801 3 months, 3 weeks ago

I would love to see a video about Vulkan, particularly with respect to its ability to run on a wide variety of different Hardware

christy_cooper 3 months, 3 weeks ago

Nice positive info thx

brianmartin440 3 months, 2 weeks ago

Very impressive

francisca_gonzález 3 months, 3 weeks ago

thanks

agathe_marion 3 months, 3 weeks ago

Perfect!

mohammed.barrett 3 months, 3 weeks ago

يجب أن تذكر ايضا الجانب السلبي أن الدقة تصبح أقل و ربما اخطاء تظهر ، شكرا على المعلومات القيمة ، رغم لا علاقة لي بالبرمجة الا انني استفيد لاني مهتم بالتقنية بشكل عام

robert.maldonado 3 months, 3 weeks ago

How about TinyML. Any related to Model compression and quantization?

angela.patterson 3 months, 2 weeks ago

I ran qwen3 on my laptop without gpu with only 8gb ram and it ran smoothly. Of course my os is Linux.

hans-heinrich.segebahn 3 months, 3 weeks ago

Nice, so I can get 2+2=5 faster , instead of 2+2=4, juhu, User will be happy, right? Imagine a calculator with option to run faster but with no guarantee calculations are correct !

gerolfechoing17 3 months, 3 weeks ago

Uh, 800GB divided by 80GB/GPU equals ten GPUs by my arithmetic.... And that's just the weights. You'd need more for your KV cache, I think. So something like at least 12 80GB GPUs.

carmen.vigil 3 months, 3 weeks ago

800 GB = 10 A100 (minimum) not five. That is also a $12,000 - $15,000 per month pile of GPUs , before you do anything with it.

reecehopkins473 3 months, 3 weeks ago

Writing partial words & abbreviations on the whiteboard isn’t as helpful as the whole word. I recommend to either write the words on the board or don’t bother writing at all.

benitosolorzano76 3 months, 2 weeks ago

Great explanation! God bless everyone!

You've reached the end