Math Behind ML: KL Divergence, Wasserstein Distance, and Total Variation Distance Explained

Tech

Three ways to measure how different two probability distributions really are — and why your choice can make or break your model. KL divergence is fast and shows up everywhere from variational inference to reinforcement learning, but it's asymmetric and explodes the moment your distributions stop overlapping. Total variation distance is a true mathematical metric with clean theory, but it saturates almost immediately and throws away the geometry of your space. Wasserstein distance — also known as Earth Mover's Distance — actually understands the cost of physically moving probability mass from one distribution to another. That single idea revolutionized generative modeling and gave us Wasserstein GANs. In this video we derive all three from first principles, expose the failure mode that kills each one, and work out which metric you should actually reach for depending on the problem you're solving. ⏱️ Chapters: 00:00 Introduction 00:13 KL Divergence 01:51 Total Variation Distance 03:23 Wasserstein / Earth Mover's distance 05:23 Recap 📚 Topics covered: - Kullback-Leibler divergence - Total variation distance - Wasserstein / Earth Mover's distance - Optimal transport theory - Wasserstein GANs (WGAN) - Failure modes of each metric - Practical guidance for ML practitioners If this helped, drop a like and subscribe for more deep dives into the math behind modern machine learning. #MachineLearning #DeepLearning #Mathematics #KLDivergence #Wasserstein #OptimalTransport #GANs #DataScience #ProbabilityTheory #InformationTheory