Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam → https://ibm.biz/Bdpsig Learn more about Small Language Models here → https://ibm.biz/Bdpsih Shrink massive AI models with ease! ⚡ Cedric Clyburn explains LLM compression and quantization techniques to optimize performance. Learn how to deploy scalable AI with cutting-edge methods for real-world applications! AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/BdpsiV #llm #aioptimization #scalableai
ADVERTISEMENT
Video starts at 5:27
Great stuff; thanks.
Very impressive
Great explanation! God bless everyone!
I ran qwen3 on my laptop without gpu with only 8gb ram and it ran smoothly. Of course my os is Linux.
Nice, so I can get 2+2=5 faster , instead of 2+2=4, juhu, User will be happy, right? Imagine a calculator with option to run faster but with no guarantee calculations are correct !
Thanks for watching folks! Apologies about the small verbal typo there, meant to say 10 GPU's :)
How about TinyML. Any related to Model compression and quantization?
Informational video thanks ! 😊
thanks
I would love to see a video about Vulkan, particularly with respect to its ability to run on a wide variety of different Hardware
My life is better thanks to IBM Technology. Thank you <3
يجب أن تذكر ايضا الجانب السلبي أن الدقة تصبح أقل و ربما اخطاء تظهر ، شكرا على المعلومات القيمة ، رغم لا علاقة لي بالبرمجة الا انني استفيد لاني مهتم بالتقنية بشكل عام
Perfect!
Here I would stress that everything is a tradeoff. And while it was said in the video that you are loosing accuracy by the quantization, I still feel that it has to be repeated over and over that by doing this you are loosing accuracy and there might be use cases where accuracy on broad categories of topics is very important. Also in most cases you need a AI model that is a specialist on just few topics / things. So reducing the number of model parameters would be also a path to think about. But in regards to the optimization - how much to scale down the size of each parameter would be a nice topic to touch upon.
This is one of the few explanations that treats LLMs as real deployment systems where inference cost and latency dominate, especially in multi-agent architectures where delays compound across nodes. I’ve been building similar AI systems, and the quantization trade-offs here match exactly what makes the difference between a prototype and a production-ready system
Easy to understand. Thank you
Uh, 800GB divided by 80GB/GPU equals ten GPUs by my arithmetic.... And that's just the weights. You'd need more for your KV cache, I think. So something like at least 12 80GB GPUs.
800 GB = 10 A100 (minimum) not five. That is also a $12,000 - $15,000 per month pile of GPUs , before you do anything with it.
Nice positive info thx