π§ Self-attention is the single most important idea in modern AI β and most tutorials get it wrong. In this video, you will see exactly how self-attention works: from the raw sentence "The cat sat" all the way to the final output vector Z, built step by step with animated Manim visuals and real matrix math. ββββββββββββββββββββββ Timstamps: ββββββββββββββββββββββ 0:06 Why Self-Attention 1:44 How Self-Attention Works (Mathematical Explanation) 9:13 Attention Heatmap 10:12 Full Self-Attention Pipeline 11:22 Outro βββββββββββββββββββββββ β WHAT YOU WILL LEARN βββββββββββββββββββββββ β Why sequential models (RNNs) fail at long-range dependencies and how self-attention solves this β The full math behind Q, K, V projections, scaled dot-product attention (QΒ·Kα΅ / βdβ), and softmax normalisation β How to read an attention heatmap and understand what the model is actually "looking at" βββββββββββββββββββββββ π€ WHO THIS IS FOR βββββββββββββββββββββββ This breakdown is for anyone who has heard of Transformers, ChatGPT, or large language models and wants to understand the actual mechanism β not just the metaphors. Prior knowledge of basic linear algebra (matrix multiplication) is helpful but not required. Every step is shown visually. βββββββββββββββββββββββ πΊ MORE FROM APPLIE AI LAB βββββββββββββββββββββββ Subscribe to Visual AI for weekly deep-dives into AI and machine learning concepts Next up: Multi-Head Attention explained the same way. #SelfAttention #AttentionMechanism #TransformerArchitecture #DeepLearning #NeuralNetworks #NaturalLanguageProcessing #MachineLearning #AIExplained #LargeLanguageModels #ManimAnimation
ADVERTISEMENT
Great video
I don't know why people think self-attention is elegant. It's a clunky and computationally inefficient mechanism for (crudely) simulating semantic comprehension (although it does work). I'd be more impressed with a network which really understood that streets can't get tired (to use the sentence in this video).
Hi, Thanks very much for such a nice and simple explaination. Have you removed that LORA visualization video? Could you please bring that back?
thank you. I finally understand
Bro you playlist shows 6 videos but there are only 3 videos. Please provide rest of the 3 videos too.
Awesome video. Great examples and illustrations.
Excellent explanation. I am figuring this stuf out to improve my working with AI.
Excellent video. Helped explain Self Attention really well. Thank you for creating this
I am trying to understand the declaration made in first 30 seconds. In my understanding, I would not associate 'it' to 'animal' w/o thinking or w/o paying attention to all words and making sense of what is being said. To me, paying attention to the last word 'tired' and then making sense of which of the words e.g. 'street ' or 'animal' sounds appropriate to relate to the word 'it'? At that moment the act of association becomes 'automatic' and not before that. Am I wrong? I am thinking that what if the statement was 'The animal didnβt cross the street because it was too crowded'?
I missed one thing: dos vector X contain similarity coefficients of letβs say word βcatβ to βcaterpillarβ and to βcatastropheβ?
Great explanation
Excellent! Please create a full course on this
i just dont get how this works in a dynamic way -- i mena how it predicts next word -- the above setup looks liek a static contextual understanding -- predicting next word means context chanegs right and hwo does the modle knwo till the precoius cutofoo point nad then predict?
It's Great
Great video! The only part I didn't understand is how we get Q, K and V vectors. I mean they are the source of all grammar and we do not know how to get it
Thanks
lovely! i was expecting u to show the final attention matrix for the origonal sentence
Why is the k matrix transposed
I'm really not understading where do the value inside matrices coming from. I am soooo confused about this. Where does the value matrix get the it's values?? which we finally multiply? the model is trained already and has those values in beginning? but how did that happen? and how do key and query matrices got their values??? I understand the flow, but not where do the values come from and how it got generated.
For anyone curious, this video and code basic's transformer video are the two best explanations I have seen. They have truly helped me understand the concepts.