机器学习笔记 -- 3Blue1Brown 深度学习 Deep Learning（已完成）

Posted on June 29, 2024 (珠海)
机器学习

深度学习 Deep Learning https://space.bilibili.com/88461692/channel/seriesdetail?sid=1528929 https://www.3blue1brown.com/topics/neural-networks

深度学习之神经网络的结构 Part 1 ver 2.0

https://www.bilibili.com/video/BV1bx411M7Zx/

But what is a Neural Network? An overview of what a neural network is, introduced in the context of recognizing hand-written digits. Chapter 1 2017 年 10 月 5 日

深度学习之梯度下降法 Part 2 ver 0.9 beta

https://www.bilibili.com/video/BV1Ux411j7ri/

Gradient descent, how neural networks learn An overview of gradient descent in the context of neural networks. This is a method used widely throughout machine learning for optimizing how a computer performs on certain tasks. Chapter 2 2017 年 10 月 16 日

Analyzing our neural network Chapter 3 2017 年 10 月 16 日

深度学习之反向传播算法上 / 下 Part 3 ver 0.9 beta

https://www.bilibili.com/video/BV16x411V7Qg/

What is backpropagation really doing? An overview of backpropagation, the algorithm behind how neural networks learn. Chapter 4 2017 年 11 月 3 日

Backpropagation calculus The math of backpropagation, the algorithm by which neural networks learn. Chapter 5 2017 年 11 月 3 日

GPT 是什么？直观解释 Transformer | 深度学习第 5 章

https://www.bilibili.com/video/BV13z421U7cs/

Embedding
Key
Query
Value
Output
Up-projection
Down-projection
Unembedding

GPT 的第一层：词嵌入为向量（embedding）嵌入空间不仅代表词，还能包含上下文信息 GPT 的最后一层：向量解码为词（Unembedding）带温度的 Softmax 函数

But what is a GPT? Visual intro to Transformers | Deep learning, chapter 5 A visual introduction to transformers. This chapter focusses on the overall structure, and word embeddings 2024 年 4 月 1 日

直观解释注意力机制，Transformer 的核心 | 深度学习第 6 章

https://www.bilibili.com/video/BV1TZ421j7Ke/

朴素的理解两个词的注意力可能会理解为计算是两个词嵌入的直接的相似度，但是其实是计算两个词分别在 QK 空间上的投影的相似度，因为如果不这么做，那么两个一样的词永远最相似。

因此 QK 两个矩阵其实表征了两个空间。即：查询空间 Q 和被查空间 K， Q 用来映射每一个词 x 的方向，K 用来映射其他每个词的方向，一旦两个空间的映射结果一致则表示两个词匹配。

想真正弄清楚 Transformer 内部的大网络在做什么，推荐 Anthropic 的网页博文： https://transformer-circuits.pub/2021/framework/index.html 我就是读了他的一篇文章后开始想，输出矩阵乘以值矩阵，其实就是嵌入空间到自身的一个低秩映射。这样想之后，至少我的概念变得更清晰了。