Transformer on 边个濑椰的博客

CS224N

Wed, 28 Jan 2026 14:31:08 +0800

Intro

本篇是对Stanford CS 224N | Natural Language Processing with Deep Learning (Spring 2024) 这门课程的学习笔记。关于这门课的知识点，总结如下：

词向量、RNN、LSTM、Seq2Seq 模型、机器翻译、注意力机制、Transformer 等等

What is this course about?

Natural language processing (NLP) or computational linguistics is one of the most important technologies of the information age. Applications of NLP are everywhere because people communicate almost everything in language: web search, advertising, emails, customer service, language translation, virtual agents, medical reports, politics, etc. In the 2010s, deep learning (or neural network) approaches obtained very high performance across many different NLP tasks, using single end-to-end neural models that did not require traditional, task-specific feature engineering. In the 2020s amazing further progress was made through the scaling of Large Language Models, such as ChatGPT. In this course, students will gain a thorough introduction to both the basics of Deep Learning for NLP and the latest cutting-edge research on Large Language Models (LLMs). Through lectures, assignments and a final project, students will learn the necessary skills to design, implement, and understand their own neural network models, using the Pytorch framework.

Word Vectors

$ vector(”King”)- vector(”Man”) + vector(”Woman”) $

results in a vector that is closest to the vector representation of the word Queen

How do we have usable meaning in a computer?

Slides里介绍了几种方式：

WordNet
one-hot vectors
word vectors

那么前两种都有其现实意义，但也有明显的弊端。

WordNet，完全靠synonum sets和hypernyms sets来确定词汇间的关系、构造复杂、无法及时吸收新词汇……
one-hot vectors给每个词都设置了symbol，尽管是用数字表达，但是相近的词之间在数学上没有联系（向量的点积为0）

那么下面就引出了一句名言：“You shall know a word by the company it keeps”

以及word vectors的思想，将词汇归一化到一个向量中，虽然每个词汇有多个词义，但其分布却近似是其多个词义的平均，并用点积来确定词向量间的相关性。

Word2vec

original word2vec paper

Word2vec很好地反映了word vectors的思想，计算中心词和相邻词的相似度，确定其概率分布：

We have a large corpus (“body”) of text: a long list of words
Every word in a fixed vocabulary is represented by a vector
Go through each position t in the text, which has a center word c and context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability

Objective Function

那么如何计算其似然程度(Likelihood)呢，For each position = $t=1,......,T$, predict context words within a window of fixed size $m$, given center word $w_t$. Data likelihood :

$$ L(\theta) = \prod_{t=1}^{T} \prod_{\substack{-m \le j \le m \\ j \neq 0}} P(w_{t+j} \mid w_t; \theta) $$

为减小数据复杂程度，及方便计算机处理，优化上公式，对其进行平均负对数似然操作，The objective function $J(\theta)$ is the (average) negative log likelihood:

$$ J(\theta) = -\frac{1}{T} \log L(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{substack{-m \le j \le m \\ j \neq 0}} \log P(w_{t+j} \mid w_t; \theta) $$

最小化 $J(\theta)$ 等价于 最大化预测准确率。

Prediction Function

$$ P(o \mid c) = \frac{\exp(u_o^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)} $$

$u_o^T v_c$ : 点积越大，代表这两个词在向量空间中的位置越接近，即语义相关性越高。 Dot product compares similarity of o and c. $u^T v = u \cdot v = \sum_{i=1}^{n} u_i v_i$ Larger dot product = larger probability.
$\exp()$ : 将任何实数映射为正数。由于指数函数的增长特性，它会放大较大点积的影响，使相关性高的词获得更高的权重。
${\sum_{w \in V} \exp(u_w^T v_c)}$ : 分母是对词汇表$V$中所有可能的词进行求和。这一步确保了所有可能输出的概率之和等于 1，从而构成一个标准的概率分布。

这也是对Softmax Function的应用

$$ \text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^{n} \exp(x_j)} = p_i $$

Softmax 函数将任意值 $x_i$ 映射为一个概率分布 $p_i$。
“max”：因为它会放大最大值 $x_i$ 对应的概率，使大的更大。
“soft”：因为它仍然会给较小的 $x_i$ 分配一些概率。

To Train the Model

To train a model, we gradually adjust parameters to minimize a loss.

$$ \theta = \begin{bmatrix} v_{aardvark} \\ v_a \\ \vdots \\ v_{zebra} \\ u_{aardvark} \\ u_a \\ \vdots \\ u_{zebra} \end{bmatrix} \in \mathbb{R}^{2dV} $$

如果词向量维度为 $d$，词汇量为 $V$，则总参数量为 $2dV$。
每个单词有两个向量，$\theta$ 包含了词汇表中所有单词的两种向量表示（中心词向量 $v$ 和背景词向量 $u$）。
计算所有参数的梯度，以优化模型

梯度公式就是对Softmax Function的求偏导，数学过程如下：

初始损失函数 (单词对的负对数似然) 这是针对一个中心词 $c$ 和一个背景词 $o$ 的基本损失定义：
$$ \text{Loss} = -\log P(o \mid c) $$
代入Softmax定义展开 将 $P(o|c)$ 的公式代入，利用对数性质将除法化为减法：
$$ \text{Loss} = -\log \left( \frac{\exp(u_o^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)} \right) = -u_o^T v_c + \log \sum_{w \in V} \exp(u_w^T v_c) $$
第一部分求导 对点积项关于中心词向量 $v_c$ 求偏导：
$$ \frac{\partial}{\partial v_c} (u_o^T v_c) = u_o $$
第二部分求导 利用链式法则对 $\log \sum \exp(\dots)$ 形式进行求导：
$$ \frac{\partial}{\partial v_c} \log \sum_{w \in V} \exp(u_w^T v_c) = \frac{1}{\sum_{w \in V} \exp(u_w^T v_c)} \cdot \sum_{x \in V} \left[ \exp(u_x^T v_c) \cdot u_x \right] $$
转化为概率期望形式 将上一步的结果重新组合，提取出原始的概率项 $P(x|c)$：
$$ \sum_{x \in V} \left[ \frac{\exp(u_x^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)} \right] u_x = \sum_{x \in V} P(x \mid c) u_x $$
最终梯度公式 将两部分合并，得到最终用于更新 $v_c$ 的梯度：
$$ \frac{\partial \text{Loss}}{\partial v_c} = -u_o + \sum_{x \in V} P(x \mid c) u_x $$

Gradient Descent

梯度下降更新方程 (矩阵形式)

$$ \theta^{new} = \theta^{old} - \alpha \nabla_{\theta} J(\theta) $$

梯度下降更新方程 (单个参数形式)

$$ \theta_j^{new} = \theta_j^{old} - \alpha \frac{\partial}{\partial \theta_j^{old}} J(\theta) $$

$\alpha$ (alpha): 步长或学习率（step size / learning rate）。

但是实际应用并非上面的方法，而是用随机梯度下降，(Stochastic Gradient Descent, SGD)

目标函数 $J(\theta)$ 是语料库中所有窗口的函数。如果每次更新都要计算整个语料库的梯度 $\nabla_{\theta} J(\theta)$，计算成本极其昂贵，在进行单次更新前需要等待极长时间。
SGD不再计算整个语料库，而是通过重复采样窗口，每处理一个（或一小批）窗口就更新一次参数。

Skip-gram Model with Negative Sampling

negative sampling paper

$$ P(o\mid c)=\frac{\exp(u_x^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)} $$

采用最传统的方法计算概率时，即softmax，它的分母是全部词的叠加，计算量太大

而Skip-gram Model with Negative Sampling不需计算所有可能的词，而是训练一些逻辑回归，更偏好实际的上下文而非随机的上下文，实际上会选K个负样本，计算量就变为了$O(K)$

$$ J_{neg-sample}(u_o, v_c, U) = -\log \sigma(u_o^T v_c) - \sum_{k \in \{K \text{ sampled indices}\}} \log \sigma(-u_k^T v_c) $$

其中，$\sigma(x)=\frac{1}{1+e^{-x}}$即sigmoid函数，使符合的结果概率接近于1，而不符合的则接近于0

但这又会导致低频词，如"zebra"的概率过低，而像"the"这样的词则概率较高，所以给公式加上$3/4$次方：

$$ P(W)=U(W)^{3/4}/Z $$

来提高低频词的概率。

GloVe

original GloVe papar (Global Vectors for Word Representation)

Co-occurrence Martix

构建共现矩阵的方法非常简单，首先设定窗口长度，然后对在窗口长度内出现的共现词频率进行计数，一个简单的例子如上图（窗口为1，仅算相邻词）。但是：

词向量维度会随着语料库中词汇的增多而大幅增加，这会导致所需存储空间增大，且矩阵会变得相当稀疏，基于此构建的模型鲁棒性较差；
功能词出现频次极高，但没有提供相应的信息；
没有反映出词距与词相关性之间的联系。

那么如何进行降维来优化？经典的方法是进行SVD矩阵分解（虽然问了AI也没明白原理，而且Assignment1中用一个sklearn的函数就解决了）

$$ X = U \Sigma V^T $$

$X$ (共现矩阵)：大小为 $|V| \times |V|$（词表大小）。每个元素 $X_{ij}$ 代表词 $i$ 和词 $j$ 在语料库中共同出现的次数。
$U$ 和 $V$ (正交矩阵)：它们的列向量是相互正交的单位向量（Orthonormal）。在 NLP 中，$U$ 的每一行通常被视为该词的原始“嵌入”。
$\Sigma$ (对角矩阵)：对角线上的值 $\sigma_1, \sigma_2, \dots$ 称为奇异值。它们按从大到小排列，代表了数据在对应维度上的重要程度（方差/信息量）。

那么降维便是将从共现矩阵得到的的，长度为$V$的词向量降维成长度为$K$的向量

真正编码语义的不是共现概率本身，而是共现概率的比值：

共现概率的比率可以编码语义成分，我们希望将它们作为线性语义成分捕捉在词向量空间中！

	$x = \text{solid}$	$x = \text{gas}$	$x = \text{water}$	$x = \text{fashion}$
$P(x\mid\text{ice})$	$1.9 \times 10^{-4}$	$6.6 \times 10^{-5}$	$3.0 \times 10^{-3}$	$1.7 \times 10^{-5}$
$P(x\mid\text{steam})$	$2.2 \times 10^{-5}$	$7.8 \times 10^{-4}$	$2.2 \times 10^{-3}$	$1.8 \times 10^{-5}$
$\dfrac{P(x\mid\text{ice})}{P(x\mid\text{steam})}$	$8.9$	$8.5 \times 10^{-2}$	$1.36$	$0.96$

Analogies

词向量虽然数学原理上很强大，但是其实际的类比(Analogy)场景却是有不少问题：

下例可见，是一个$woman+grandfather-man=?$的问题，那么显而易见且最有可能的结果就是grandmother，那么为何程序给出的另外几个，如granddaughter, mother之类，score也几乎一样很高呢？

1
2


# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659888029098511),
 ('aunt', 0.6623409390449524),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]

虽说Assignment里没有标准答案，但是通过AI可以得知：这涉及到“语义聚类”

语义的“近邻效应”
- 在向量空间中，逻辑相似的词通常会聚在一起形成一个“簇”。当你计算出 $\vec{w} + \vec{g} - \vec{m}$ 时，你实际上是在空间中定位了一个坐标点。
- granddaughter 等同样具有“女性”和“亲属”属性，且在语料库中经常与 grandfather 或 grandmother 出现在相似的上下文。在向量维度上，它们在“亲属关系”这个维度上非常接近。
维度的“重叠性”
- 词向量通常有几百个维度。虽然我们减去了 man 并加上了 grandfather，但这并不能完全抹除其他维度的相似性。
- daughter、mother、grandmother 共享了绝大部分维度：[+女性]、[+人类]、[+亲属]。

下面的例子，属于"Incorrect Analogy"，我们来看易得答案为socks，但为何模型忽略了glove和hand，而输出了各种square的名词？显然和foot“脚”的释义无关，而是“英尺”的释义。

1

pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]

语义多义性的干扰
- 上面也提到，foot也是计量单位”英尺“的英文，和square组合是合理的
训练语料的偏见
- 既然foot有多个释义，但输出却全是其“英尺”的意思，那么这可能体现了用来训练的语料中多是...square和foot相关联
词的选择
- 即使输出全是...square，其score值也不过0.5，这不仅说明模型找不到强相关的词，而且说明了模型很可能没能理解glove与hand, foot的联系

Neural Network

A neural network = running several logistic regressions at the same time.

CS231n Deep Learning on Network Architectures

CS231n Deep Learning for Computer Vision on Backprop

Structure

Non-linearities

为什么神经网络需要非线性(Non-linearities)？

核心观点：神经网络执行函数逼近，例如回归或分类。
- 没有非线性：深度神经网络只能执行线性变换。
- 多层失效：额外的层会被压缩成单个线性变换：$W_1 W_2 x = Wx$（即多层线性层等同于一层）。
- 有了非线性：通过包含非线性的多层结构，网络可以逼近更复杂的函数！

左下侧图对：左图显示线性分类（只能画直线），无法区分复杂的红绿点分布；右图显示非线性分类（可以画曲线），完美分割了数据。
右侧三张波形图：展示了随着函数复杂度的增加，只有非线性模型才能拟合这些起伏的蓝点（观测数据）。

关于常用的非线性函数，在《智能计算系统》课上均有学习，不多赘述

Gradients

derivatives.pdf

梯度，简单地讲就是对变量的微积分，比如我们有：

$$ f(x)=x^3 $$

那么它的梯度就是：

$$ \frac{df}{dx}=3x^2 $$

当然，这只是个非常简单的例子，实际上是大规模的链式法则计算，对矩阵(Jacabian Matrix)进行梯度计算。

Chain Rule

在单变量微积分中，如果 $y = f(u)$ 且 $u = g(x)$，那么：

$$ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} $$

但在神经网络中，每一层都是一个向量 $\mathbf{h,z} \in \mathbb{R}^n$。当我们将这个逻辑扩展到向量时，乘法就变成了矩阵乘法。

对于多个变量，应与Jacobians矩阵相乘：

假设有 $\mathbf h= f(z)$ 和 $\mathbf z=Wx+b$ ，下面的这些偏导数就是雅可比矩阵

$$ \frac{\partial \mathbf{h}}{\partial \mathbf{x}} = \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{x}} $$

Matrix Calculus

由下式得，矩阵只有对角元素，其余部分均为0：

$$ \begin{aligned} \left( \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \right)_{ij} &= \frac{\partial h_i}{\partial z_j} = \frac{\partial}{\partial z_j} f(z_i) \quad && \text{definition of Jacobian} \\ &= \begin{cases} f'(z_i) & \text{if } i = j \\ 0 & \text{if otherwise} \end{cases} \quad && \text{regular 1-variable derivative} \end{aligned} $$$$ \frac{\partial \mathbf h}{\partial \mathbf z} = \begin{pmatrix} f'(z_1) & 0 & \cdots & 0 \\ 0 & f'(z_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & f'(z_n) \end{pmatrix} = \operatorname{diag}(f'(\mathbf z)) $$

此外，还有一常有的Jacobian：

$$ \frac{\partial}{\partial \mathbf{u}}(\mathbf{u}^T \mathbf{h})=\mathbf h^T $$

假设 $\mathbf{u}$ 和 $\mathbf{h}$ 都是 $n$ 维列向量：

$$ \mathbf{u} = \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{bmatrix}, \quad \mathbf{h} = \begin{bmatrix} h_1 \\ h_2 \\ \vdots \\ h_n \end{bmatrix} $$

那么它们的内积（也就是括号里的部分）是一个标量：

$$ f = \mathbf{u}^T \mathbf{h} = u_1 h_1 + u_2 h_2 + \dots + u_n h_n = \sum_{i=1}^n u_i h_i $$

我们要对向量 $\mathbf{u}$ 求导。根据 Jacobian Matrix 的定义，我们需要对 $\mathbf{u}$ 中的每一个元素 $u_k$ 分别求导：

$$ \frac{\partial f}{\partial u_k} = \frac{\partial}{\partial u_k} (u_1 h_1 + \dots + u_k h_k + \dots + u_n h_n) $$

由于除了 $u_k h_k$ 这一项外，其他项都不包含 $u_k$，所以它们对 $u_k$ 的导数都是 $0$：

$$ \frac{\partial f}{\partial u_k} = h_k $$

按照 Jacobian 矩阵的惯例，一个标量对一个列向量求导，结果是一个行向量：

$$ \frac{\partial f}{\partial \mathbf{u}} = \begin{bmatrix} \frac{\partial f}{\partial u_1} & \frac{\partial f}{\partial u_2} & \dots & \frac{\partial f}{\partial u_n} \end{bmatrix} = \begin{bmatrix} h_1 & h_2 & \dots & h_n \end{bmatrix} = \mathbf{h}^T $$

Write out the Jacobians

$$ \begin{aligned} \frac{\partial s}{\partial \mathbf{b}} &= \frac{\partial s}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{b}} \\ &= \mathbf{u}^T \text{diag}(f'(\mathbf{z})) \mathbf{I} \\ &= \mathbf{u}^T \odot f'(\mathbf{z}) \end{aligned} $$

$\odot$ = Hadamard product = element-wise multiplication of 2 vectors to give vector

变量	神经网络中的含义	说明
$s$	损失函数值 (Loss/Score)	最终的标量输出（例如交叉熵损失）。我们要看它随参数如何变化。
$\mathbf{b}$	偏置向量 (Bias)	当前层的偏置项，神经网络需要学习的参数之一。
$\mathbf{z}$	净输入 (Logits/Pre-activation)	线性组合后的结果，即 $\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}$。
$\mathbf{h}$	激活值 (Activation/Hidden state)	$\mathbf{z}$ 经过非线性激活函数后的输出，即 $\mathbf{h} = f(\mathbf{z})$。
$\mathbf{u}^T$	上游传回的梯度 ($\frac{\partial s}{\partial \mathbf{h}}$)	损失函数对当前层输出的导数，它是从更高层“反向传播”回来的信号。
$f'(\mathbf{z})$	激活函数的导数	比如 ReLU 或 Sigmoid 的导数。它决定了哪些神经元处于活跃状态。
$\mathbf{I}$	单位矩阵 (Identity Matrix)	因为 $\mathbf{z} = \dots + \mathbf{b}$，$\mathbf{z}$ 对 $\mathbf{b}$ 求导的结果是 1（矩阵形式即为单位阵）。

Re-using Computation

误差信号（Upstream Gradient）$\boldsymbol{\delta}$：

$$ \boldsymbol{\delta} = \frac{\partial s}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{z}} = \mathbf{u}^T \circ f'(\mathbf{z}) $$

我们先算出这个值$\boldsymbol{\delta}$ ，可以简化计算：

对权重矩阵 $W$ 的梯度展开：
$$ \frac{\partial s}{\partial \mathbf{W}} = \boldsymbol{\delta} \frac{\partial \mathbf{z}}{\partial \mathbf{W}} $$
对偏置向量 $\mathbf{b}$ 的梯度简化：
$$ \frac{\partial s}{\partial \mathbf{b}} = \boldsymbol{\delta} \frac{\partial \mathbf{z}}{\partial \mathbf{b}} = \boldsymbol{\delta} $$

Shape Convention

假设权重 $\mathbf{W} \in \mathbb{R}^{n \times m}$，输出是一个标量 $s$（损失）。按纯数学定义，$\frac{\partial s}{\partial \mathbf{W}}$ 应该是一个 $1 \times nm$ 的行向量（Jacobian）。但如果使用这种形式，梯度更新公式 $\theta^{new} = \theta^{old} - \alpha \nabla_{\theta} J(\theta)$ 就会因维度不匹配而无法直接相减。

为了计算方便，我们约定梯度的形状必须等于参数的形状。因此 $\frac{\partial s}{\partial \mathbf{W}}$ 也是一个 $n \times m$ 的矩阵：

$$ \frac{\partial s}{\partial \mathbf{W}} = \begin{bmatrix} \frac{\partial s}{\partial W_{11}} & \dots & \frac{\partial s}{\partial W_{1m}} \\ \vdots & \ddots & \vdots \\ \frac{\partial s}{\partial W_{n1}} & \dots & \frac{\partial s}{\partial W_{nm}} \end{bmatrix} $$$$ \frac{\partial s}{\partial \mathbf{W}} = \boldsymbol{\delta}^T \mathbf{x}^T $$

那么我们究竟应该以什么样的“形状”来呈现导数结果？

采取始终遵循**形状约定 (Shape Convention)**的方式：

做法：不拘泥于严格的 Jacobian 定义，而是时刻盯着变量的维度（Dimensions）。
核心技巧：通过观察维度来决定何时需要对某个项进行转置，或者调整矩阵相乘的顺序，以确保每一层算出来的梯度形状和该层的参数形状完全一致。
关于 $\boldsymbol{\delta}$ 的重要结论：传导至隐层（Hidden layer）的误差信号 $\boldsymbol{\delta}$，其维度应该与该隐层的神经元数量（即激活值向量的维度）完全相同。

Backpropagation

按流程逐步计算各函数，从输入得到输出，即是正向传播(Forward Propagation)

对于反向传播中的单个节点，有$downstream\ gradient=upstream\ gradient\times local\ gradient$

对于单节点的多输入，upstream gradient不变，各输入的local gradient不同，但计算公式是不变的

下面举个多输入的实际例子

我们可以在这个基础上，假设输入y的值变为了2.1，那么$a=x+y=3.1$，$b=max(y+z)=y=2.1$，$a\times b=6.51$

所以，y值变化的0.1导致了结果0.51的变化，那么梯度就是$\frac{\Delta f}{\Delta y}=5.1$

Implementations

那么理论上，在已知正向传播的符号和计算的情况下，计算机可以自动得出反向传播的结果。但是在现代框架中，用户需手动设计局部导数的结算，这也比全自动方式提升了系统的运行效率和稳定性。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


class MultiplyGate(object):
 def forward(self, x, y):
 z = x * y
 self.x = x # must keep these around!
 self.y = y
 return z

 def backward(self, dz):
 dx = self.y * dz # [dz/dx * dL/dz]
 dy = self.x * dz # [dz/dy * dL/dz]
 return [dx, dy]

Numeric Gradient Checking

在手动推导和实现反向传播（Backprop）时，这是确保你数学公式没推错、代码没写错的标准验证方法：

$$ f'(x) \approx \frac{f(x + h) - f(x - h)}{2h} $$

只需要前向传播函数 $f(x)$ 即可计算，不需要任何复杂的数学推导，不容易写错。
必须对模型的每一个参数分别进行两次前向传播（加 $h$ 和减 $h$），效率很低。
适合局部测试，不要对整个大型网络做验证，只针对某个特定层或小规模参数（如一个 $3 \times 3$ 的矩阵）进行。

Dependency Parsing

Syntactic Structure

Phrase structure organizes words into nested constituents.

我们可以自己定义phrase构成的语法，比如名词短语可以是“限定词 + 形容词 + 名词”，“限定词 + 名词 + 介词短语”……介词短语可以是“介词 + 名词……”

Dependency structure shows which words depend on (modify, attach to, or are arguments of) which other words

语言中很容易出现歧义(ambiguity)，在英语里有了介词短语，歧义就更多了，举个例子：

“Scientists count whales from space"可以理解为"Scientists [count] [whales from space]” 或者 “Scientists [count whales] [from space]” ……

Dependency Grammar and Treebanks

Dependency syntax postulates that syntactic structure consists of relations between lexical items, normally binary asymmetric relations (“arrows”) called dependencies

下图是个比较古老的"dependency structure"

An arrow connects a head (governor, superior, regent) with a dependent (modifier, inferior, subordinate)

Usually, dependencies form a tree (a connected, acyclic, single-root graph)

Annotated Data

起初，构建语言库（Treebank）似乎比手动编写语法规则要慢得多，且看起来没那么有用。手动标注数据确实是件麻烦的工作，但现在看来，却有很大的优势：

复用性(Reusability)：一套标注好的数据可以用来训练多种解析器（Parsers）、词性标注器（POS Taggers）
覆盖面广(Board coverage)：手动编写规则往往只能覆盖几个直觉上的例子，而标注真实语料可以涵盖语言在现实中的各种复杂用法。
频率与分布信息(Frequencies and distributional information)：它能告诉机器哪些结构更常见，帮助概率模型做出更准确的判断。
评估NLP(A way to evaluate NLP systems)：没有这套“标准”，我们就无法衡量一个 AI 模型的准确率（如 LAS/UAS 得分）

那么关于示例图中的各依存关系的意思，如下：

缩写	意思	简单理解
nsubj	名词主语	动作的发出者（如 I think）
nsubjpass	被动主语	被动语态里的主语（如 city called）
ccomp	从句	动词后面的整个小句子（如 think …）
advmod	状语修饰	修饰动词的程度、疑问等（如 Why）
amod	形容词修饰	修饰名词的形容词（如 famous goat）
compound	复合修饰	名词修饰名词（如 goat trainer）
det	限定关系	指向 a, the, any 等词
case	介词关系	指向 in, at 等介词
conj	并列关系	用 or, and 连接的词（如 trainer or something）

Dependency Conditioning Preferences

在进行句法分析时，计算机会根据“依存条件偏好（Dependency Conditioning Preferences）”来判断两个词之间是否存在依存关系：

双词亲和力 (Bilexical affinities)：依存关系（例如 [discussion → issues]）是合理的。
依存距离 (Dependency distance)：大多数（但并非所有）依存关系发生在邻近的单词之间。
介入材料 (Intervening material)：依存关系很少跨越中间的动词或标点符号。
中心词的价态 (Valency of heads)：对于一个中心词（Head）来说，通常在它的哪一侧会有多少个依存词？

Projectivity

如果一个句子的单词按线性顺序排列，且所有的依存弧（dependency arcs）都画在单词上方时，没有任何两条弧线发生交叉，那么这个解析就是“投影的”。那么互出现了交叉，就是非投影的(non-projectivity)，说明发生了“长距离位移”或结构重叠。

但是现实中非投影的例子很常见，比如"Who did Bill buy the coffee from yesterday"

Transition-Based Dependency Parser

首先，这Transition-Based Dependency Parser分有一个Stack和一个Buffer，和三种操作：

Start: $\sigma = [ROOT], \beta = w_1, ..., w_n, A = \emptyset$

Shift: $\sigma, w_i | \beta, A \Rightarrow \sigma | w_i, \beta, A $
Left-$Arc_r$: $\sigma | w_i | w_j, \beta, A \Rightarrow \sigma | w_j, \beta, A \cup \{r(w_j, w_i)\} $
Right-$Arc_r$: $\sigma | w_i | w_j, \beta, A \Rightarrow \sigma | w_j, \beta, A \cup \{r(w_i, w_j)\}$

Finish: $\sigma = [w], \beta = \emptyset$

σ（sigma）表示 堆栈（stack），存储当前正在处理或等待建立关系的词。
β（beta）表示 缓冲区（buffer），存储尚未处理的输入词序列。
A 表示 依存弧集合（set of dependency arcs），存储已建立的依存关系。
Left-$Arc_r$ 和 Right-$Arc_r$ 是两种规约模型，用来表明一个词是另一个的依存词，以左或右方向。

那么下面先来举个例子：Analysis of “I ate fish”

Left Arc操作生成了一个由栈顶指向第二个元素的弧，建立了"ate“是中心词，“I"依存于”ate“的关系。然后将”I“移出栈。
Shift操作将”fish“从缓冲区移入栈中
Right Arc操作生成了一个由第二个元素指向栈顶元素的弧，建立了”ate“是中心词,"fish“依存于”ate“的关系。然后将”fish“移出栈。
最后的Right Arc操作使”[root]“指向”ate"，并在”ate“出栈后，只剩下根节点，操作完成。

Evaluation of Dependency Parsing

评价依存分析的指标分为UAS（无标签附件分数）和 LAS（有标签附件分数），下面是一个具体的例子: “[ROOT] She saw the video lecture."，表中"Gold"是标准答案，“Parsed"是分析出的答案：

可见，UAS计算的是Head寻找得是否正确，本例中第三个"the"的依存词和标准不符。
LAS计算的它们之间的关系类型是否标注正确，即Head和关系标签(label)都要一致，本例中只有"She"和"saw"是与标准相符的。

Neural dependency parsing

More than 95% of parsing time is consumed by feature computation

所以，我们可以用神经网络来加速特征提取，当然其方法还是基于上面的基于转移（Transition-based）的神经依存句法分析器。具体使用了了向量化、非线性等深度学习知识搭建出了第一个基于神经网络的依存分析器(2014年)。

Recurrent Neural Networks

Language Modeling

语言模型(Language Modeling) 简单地讲就是输入文本（词），输出概率。

$$ \begin{aligned} P(\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(T)}) &= P(\boldsymbol{x}^{(1)}) \times P(\boldsymbol{x}^{(2)} | \boldsymbol{x}^{(1)}) \times \dots \times P(\boldsymbol{x}^{(T)} | \boldsymbol{x}^{(T-1)}, \dots, \boldsymbol{x}^{(1)}) \\ &= \prod_{t=1}^{T} \underbrace{P(\boldsymbol{x}^{(t)} | \boldsymbol{x}^{(t-1)}, \dots, \boldsymbol{x}^{(1)})}_{\text{This is what our LM provides}} \end{aligned} $$

$P(\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(T)})$ 表示一整个序列（如一句话）出现的概率，通过将联合概率分解为一系列条件概率的乘积（链式法则），我们可以计算序列的概率。语言模型的核心任务就是根据之前的上下文 $\boldsymbol{x}^{(t-1)}, \dots, \boldsymbol{x}^{(1)}$ 来预测下一个 token $\boldsymbol{x}^{(t)}$ 出现的概率。

n-gram Language Models

An n-gram is a chunk of n consecutive words. n表示由几个词构成一个单位，即n-gram。为实现n-gram Language Models，步骤如下：

首先，做一个马尔可夫假设：第 $t+1$ 个词 $x^{(t+1)}$ 仅取决于其前面的 $n-1$ 个词。
$$ P(x^{(t+1)} | x^{(t)}, \dots, x^{(1)}) = P(x^{(t+1)} | \underbrace{x^{(t)}, \dots, x^{(t-n+2)}}_{n-1 \text{ words}}) \quad \text{(assumption)} $$
利用条件概率的定义，上述公式可以写为 $n$-gram 与 $(n-1)$-gram 概率的比值：
$$ = \frac{P(x^{(t+1)}, x^{(t)}, \dots, x^{(t-n+2)}) \leftarrow \text{prob of a n-gram}}{P(x^{(t)}, \dots, x^{(t-n+2)}) \leftarrow \text{prob of a (n-1)-gram}} \quad \text{(definition of conditional prob)} $$
通过在大型文本语料库中计数 (Counting) 它们来统计词组出现的频率，来近似概率：
$$ \approx \frac{\text{count}(x^{(t+1)}, x^{(t)}, \dots, x^{(t-n+2)})}{\text{count}(x^{(t)}, \dots, x^{(t-n+2)})} \quad \text{(statistical approximation)} $$

举个例子，假设我们有一个4-gram Language Model，要预测最后一个空可能出现的单词：

“as the proctor started the clock, the students opened their ……”

我们只取最后的三个词组成的短语”students opened their”

$$ P(w\mid students\ opened\ their)=\frac{count(students\ opened\ their\ w)}{count(students\ opened\ their)} $$

根据语料库，"students opened their books“可能是出现频率最高的，而更符合语境的”……exams“出现频率更低

Problems with n-gram Language Models

当使用计数法计算概率，会面临稀疏性问题：

如果词组 “students opened their $w$” 在训练数据中从未出现过，那么对于任何该词 $w$，其概率都将变为 0。

通过为词表中的每个词 $w \in V$ 的计数增加一个极小的数值 $\delta$ (平滑化 Smoothing)解决
如果前缀 “students opened their” 在训练数据中从未出现过，我们将无法计算任何 $w$ 的概率，因为分母为 0。

不再考虑完整的前缀，而是退而求其次，仅根据更短的上下文（例如 “opened their”）进行条件概率计算。(回退 Backoff)

同时也会面临存储的问题：

需要存放语料库中所有n-gram的计数
若是n需要增加，语料库的size也要大幅增加

A fixed-window neural Language Model

输入层 (Words / One-hot vectors): 输入为单词的 one-hot 向量 $\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}$。
嵌入层 (Concatenated word embeddings): 将单词转换为稠密的向量表示（embeddings），并进行拼接：
$$ \boldsymbol{e} = [\boldsymbol{e}^{(1)}; \boldsymbol{e}^{(2)}; \boldsymbol{e}^{(3)}; \boldsymbol{e}^{(4)}] $$
隐藏层 (Hidden layer): 通过权重矩阵 $W$ 和偏置 $b_1$ 进行线性变换，并经过激活函数 $f$（通常为 tanh 或 ReLU）：
$$ \boldsymbol{h} = f(W\boldsymbol{e} + \boldsymbol{b}_1) $$
输出层 (Output distribution): 经过权重矩阵 $U$ 和偏置 $b_2$，最后通过 softmax 函数生成词典 $V$ 上的概率分布 $\hat{\boldsymbol{y}}$：
$$ \hat{\boldsymbol{y}} = \text{softmax}(U\boldsymbol{h} + \boldsymbol{b}_2) \in \mathbb{R}^{|V|} $$

那么相对于n-gram方法，改进了：

解决稀疏性问题：不再依赖精确的计数，通过向量空间的相似性来泛化未见过的词组。
存储优化：不需要存储所有观察到的 $n$-gram 频率，只需存储模型参数。

但是仍有问题没能解决：

固定窗口限制：
- 固定的上下文窗口通常太小。
- 扩大窗口会线性导致权重矩阵 $W$ 的参数量激增。
- 无论窗口多大，它永远无法捕捉超出该范围的长程依赖。
缺乏对称性：输入 $\boldsymbol{x}^{(1)}$ 和 $\boldsymbol{x}^{(2)}$ 与矩阵 $W$ 中完全不同的权重相乘，模型处理每个位置输入的方式没有一致性。

RNN Language Model

The Unreasonable Effectiveness of Recurrent Neural Networks

RNN 的优点：

可以处理任意长度的输入。
理论上，第 $t$ 步的计算可以使用很多步之前的信息。
模型大小固定：增加输入长度不会增加模型参数量。
对称性：每一步应用相同的权重，处理输入的方式具有一致性。

RNN 的缺点：

计算速度慢：由于是递归计算，无法并行处理。
实践难题：在实际应用中，很难获取到很多步之前的信息（梯度消失/爆炸问题）。

Train an RNN Language Modle

获取一个大型文本语料库，它是由单词序列组成的：$\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(T)}$
将序列输入 RNN-LM，并为每一个时间步 $t$ 计算输出分布 $\hat{\boldsymbol{y}}^{(t)}$。这意味着模型在给定目前为止已见单词的情况下，预测每一个位置上可能出现的单词概率分布。
模型在每一个步长 $t$ 都会产生一个损失，第 $t$ 步的损失函数是预测概率分布 $\hat{\boldsymbol{y}}^{(t)}$ 与真实的下一个单词 $\boldsymbol{y}^{(t)}$（即 $\boldsymbol{x}^{(t+1)}$ 的 one-hot 向量）之间的交叉熵
$$ J^{(t)}(\theta) = CE(\boldsymbol{y}^{(t)}, \hat{\boldsymbol{y}}^{(t)}) = - \sum_{w \in V} \boldsymbol{y}^{(t)}_w \log \hat{\boldsymbol{y}}^{(t)}_w = - \log \hat{\boldsymbol{y}}^{(t)}_{\boldsymbol{x}_{t+1}} $$
为了获得整个训练集的损失，需要将所有步骤的损失取平均值：
$$ J(\theta) = \frac{1}{T} \sum_{t=1}^{T} J^{(t)}(\theta) = \frac{1}{T} \sum_{t=1}^{T} - \log \hat{\boldsymbol{y}}^{(t)}_{\boldsymbol{x}_{t+1}} $$
计算损失应用了Teacher Forcing的概念，即并不使用模型在上一步实际预测出的单词作为下一步的输入，而是直接将语料库中的真实正确答案喂给模型。
一次性计算整个语料库 $\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(T)}$ 的损失（Loss）和梯度（Gradients）在内存方面是极其昂贵的，所以在实际操作中，我们将序列 $\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(T)}$ 看作是一个个句子或文档，使用随机梯度下降来对一**小块数据 **计算损失和梯度，并立即进行参数更新

Backpropagation for RNN

RNN参数的训练，采用随时间的反向传播，沿着时间步 $i = t, \dots, 0$ 反向传播，并在过程中累加梯度。

由于 $\boldsymbol{W}_h$ 在每一个时间步都是共享的（相同的权重），因此总梯度是每一个时间步产生的梯度之和。

$$ \frac{\partial J^{(t)}}{\partial \boldsymbol{W}_h} = \sum_{i=1}^{t} \left. \frac{\partial J^{(t)}}{\partial \boldsymbol{W}_h} \right|_{(i)} \frac{\partial \boldsymbol{W}_h|_{(i)}}{\partial \boldsymbol{W}_h} = \sum_{i=1}^{t} \left. \frac{\partial J^{(t)}}{\partial \boldsymbol{W}_h} \right|_{(i)} $$

随着序列增长，完整的反向传播计算量极大，且容易出现梯度消失或爆炸问题。在实际应用中，为了提高训练效率，通常会在约 20 个时间步后进行“截断”。

Exploding Gradient

梯度爆炸：

如果 $W_h$ 的特征值（大致可以理解为权重的大小）大于 1。
随着时间步 $T$ 的增加，梯度会呈指数级增长。
模型权重会更新得过大，导致网络变得极不稳定，参数值可能会溢出（变成 NaN），训练崩溃。

如果在更新模型参数前，梯度的范数 (norm) 超过了预设的某个阈值，则按比例将其缩小，如果 $\|\hat{\boldsymbol{g}}\| \ge threshold$ ，则进行梯度裁剪，梯度裁剪让模型更新时保持在相同的方向上，但只迈出更小的一步：

$$ \hat{\boldsymbol{g}} \leftarrow \frac{threshold}{\|\hat{\boldsymbol{g}}\|} \hat{\boldsymbol{g}} $$

Vanishing Gradient

梯度消失：

如果 $W_h$ 的特征值小于 1，或者激活函数（如图中提到的 $f$ 或 tanh）的导数小于 1。
梯度会随着反向传播的步数增加而指数级减小。
这对应了提到的 RNN 缺点：“在实践中，很难获取到很多步之前的信息”。当梯度变得极其微小时，远距离的权重更新几乎停滞，模型“忘记”了长期的上下文。

对于 vanilla RNN（基础 RNN）来说，学习如何跨越多个时间步长来保留信息实在是太困难了，因为隐藏状态 $\boldsymbol{h}^{(t)}$ 会被不断地重写：

$$ \boldsymbol{h}^{(t)} = \sigma(\boldsymbol{W}_h \boldsymbol{h}^{(t-1)} + \boldsymbol{W}_x \boldsymbol{x}^{(t)} + \boldsymbol{b}) $$

所以，我们引入独立记忆，如LSTM。或者建立直接连接，如Attention机制。

Long Short-Term Memory

Understanding LSTM Networks – colah’s blog

遗忘门 (Forget gate)：控制从上一个细胞状态中保留什么以及忘记什么。

$$ \boldsymbol{f}^{(t)} = \sigma (\boldsymbol{W}_f \boldsymbol{h}^{(t-1)} + \boldsymbol{U}_f \boldsymbol{x}^{(t)} + \boldsymbol{b}_f) $$

输入门 (Input gate)：控制新细胞内容的哪些部分被写入到细胞中。

$$ \boldsymbol{i}^{(t)} = \sigma (\boldsymbol{W}_i \boldsymbol{h}^{(t-1)} + \boldsymbol{U}_i \boldsymbol{x}^{(t)} + \boldsymbol{b}_i) $$

输出门 (Output gate)：控制细胞的哪些部分被输出到隐藏状态中。

$$ \boldsymbol{o}^{(t)} = \sigma (\boldsymbol{W}_o \boldsymbol{h}^{(t-1)} + \boldsymbol{U}_o \boldsymbol{x}^{(t)} + \boldsymbol{b}_o) $$

New cell content：这是要被写入细胞的新内容（即我们在之前讨论中提到的“候选内容”）。

$$ \tilde{\boldsymbol{c}}^{(t)} = \tanh (\boldsymbol{W}_c \boldsymbol{h}^{(t-1)} + \boldsymbol{U}_c \boldsymbol{x}^{(t)} + \boldsymbol{b}_c) $$

Cell state：擦除（“忘记”）上一个细胞状态中的某些内容，并写入（“输入”）一些新的细胞内容。

$$ \boldsymbol{c}^{(t)} = \boldsymbol{f}^{(t)} \odot \boldsymbol{c}^{(t-1)} + \boldsymbol{i}^{(t)} \odot \tilde{\boldsymbol{c}}^{(t)} $$

隐藏状态 (Hidden state)：从细胞中读取（“输出”）某些内容。

$$ \boldsymbol{h}^{(t)} = \boldsymbol{o}^{(t)} \odot \tanh \boldsymbol{c}^{(t)} $$

Step-by-Step LSTM Walk Through

在上图中，每条线都承载着一个完整的向量，从一个节点的输出指向其他节点的输入。粉色圆圈代表逐点运算，例如向量加法，而黄色方框代表已学习的神经网络层。合并的线条表示连接，而分叉的线条表示其内容被复制，并将副本发送到不同的位置。

LSTM 的关键在于单元状态，也就是图中贯穿顶部的水平线。

单元状态就像一条传送带，它沿着整个链条直线传递，只有一些微小的线性交互。信息很容易原封不动地沿着传送带流动。
LSTM 能够对单元状态进行信息添加或移除，这种操作由称为“门”的结构进行精细调控。

门是一种选择性地允许信息通过的方式。它们由一个 sigmoid 神经网络层和一个逐点乘法运算组成。

LSTM 的第一步是决定我们要从细胞状态中丢弃什么信息。这个决定由一个称为“遗忘门层”的 sigmoid 层来做出。它接收 $h_{t-1}$ 和 $x_t$，并为细胞状态 $C_{t-1}$ 中的每个数字输出一个介于 0 和 1 之间的数值；1 代表“完全保留”，而 0 代表“彻底丢弃” 。
让我们回到之前那个尝试根据前面所有单词预测下一个单词的语言模型例子。在这样一个问题中，细胞状态可能包含了当前主语的性别信息，以便模型能使用正确的代词。当我们看到一个新的主语时，我们会希望忘记旧主语的性别。

下一步是决定我们要将什么新信息存储在细胞状态中。这包含两个部分。首先，一个称为“输入门层”的 sigmoid 层决定了我们将更新哪些值；接着，一个 $\tanh$ 层创建了一个包含新候选值的向量 $\tilde{C}_t$，这些值可以被添加到状态中。在下一步里，我们将结合这两个部分来对状态进行更新。
在我们的语言模型例子中，我们会希望将新主语的性别添加到细胞状态中，以替换我们正在遗忘的旧性别信息。

现在是时候将旧的细胞状态 $C_{t-1}$ 更新为新的细胞状态 $C_t$ 了。前面的步骤已经决定了要做什么，我们只需要去实际执行它。
我们将旧状态乘以 $f_t$，以此来忘掉我们之前决定要忘记的信息。然后我们加上 $i_t * \tilde{C}_t$。这就是新的候选值，根据我们决定更新每个状态值的程度进行了缩放。

在语言模型的例子中，正是在这里，我们实际丢弃了关于旧主语性别的信息，并添加了新信息，就像我们在前面步骤中决定的那样。

最后，我们需要决定我们要输出什么。这个输出将基于我们的细胞状态，但会是一个经过过滤的版本。

首先，我们运行一个 sigmoid 层，它决定了我们要输出细胞状态的哪些部分。然后，我们将细胞状态通过 $\tanh$（将值推到 -1 和 1 之间）并将其乘以 sigmoid 门的输出，这样我们就只输出了我们决定输出的部分。
以语言模型为例，由于它刚刚处理了一个主语，接下来可能会倾向于输出与谓语动词相关的信息，以预判后续内容。例如，模型可能会输出该主语是单数还是复数，这样如果接下来出现动词，我们就能知道该以何种形式进行词形变化。

How does LSTM solve vanishing gradients

LSTM 架构使得 RNN 更容易在多个时间步长内保留信息。例如，如果某个单元维度的遗忘门设置为 1，输入门设置为 0，那么该单元的信息将被无限期地保留。

相比之下，普通的 RNN 更难学习一个循环权重矩阵 $W_h$，以保留隐藏状态中的信息。
尽管梯度消失/爆炸现象无法避免，但对于长距离依赖关系，模型还可以创建更直接、更线性的直通连接。比如ResNet, DenseNet……都在模块之间、层之间不同程度地创建了直接连接。

Bidirectional RNNs

传统的单向 RNN 或 LSTM 存在一个明显的局限：在处理序列时，它只能“向左看”（即只利用过去的历史上下文）。然而，在许多 NLP 任务（如情感分类、命名实体识别或句子整体理解）中，当前词的准确含义往往也依赖于“右侧”（未来）的上下文。
为了解决这个问题，研究者引入了双向架构（通常使用 LSTM 实现，即 BiLSTM）：
- 前向 (Forward) RNN：按照从左到右的顺序处理输入序列，计算出一系列隐藏状态 $\overrightarrow{h}_t$。
- 后向 (Backward) RNN：按照从右到左的逆序处理同一个输入序列，计算出一系列隐藏状态 $\overleftarrow{h}_t$。
- 拼接状态 (Concatenated State)：在每一个时间步 $t$，将前向和后向的隐藏状态拼接在一起，形成该位置的最终表示 $h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]$。这样，每个词的特征表示就同时包含了整个句子的左侧和右侧完整信息。
双向 LSTM 的特征提取能力非常强大，但它仅适用于能够一次性获取完整输入序列的任务（例如对整段文本进行分类，或翻译时的源句子编码）。它不能用于传统的语言模型（Language Modeling），因为语言模型的本质任务是“预测下一个词”，如果允许模型看到右侧的“未来”信息，就违背了自回归预测的初衷。

Neural Machine Translation

神经机器翻译是 NLP 深度学习领域的第一个巨大成功。NMT 主要是基于 Sequence-to-Sequence (Seq2Seq) 架构，该架构的核心正是由两个 RNN（通常是 LSTM）组成的：编码器 (Encoder) 和 解码器 (Decoder)。
Encoder负责读取源语言句子，但在读取过程中它不产生实际的翻译输出，而是不断更新其隐藏状态。当 Encoder 处理完源句子的最后一个词后，其最终的隐藏状态（Final Hidden State）被视作整个句子的语义浓缩。它充当了一个“信息瓶颈”，因为整个源句子的所有复杂含义都必须被压缩进这一个固定维度的向量中。
Decodere端的 LSTM 本质上是一个条件语言模型 (Conditional Language Model)，初始隐藏状态不再是随机的或全零的，而是被严格赋值为 Encoder 输出的那个“瓶颈”向量。这意味着 Decoder 的所有生成动作都是以源句子的语义向量为条件展开的。在每一个时间步，它根据当前的隐藏状态输出概率最高的词，并将该词（Feeding in last word）作为下一步的输入继续循环，直到生成句子结束标记 <EOS>。

Pytorch基础

Thu, 22 Jan 2026 10:49:28 +0800

本篇是对Pytorch基础使用的学习，主要基于B站小土堆的https://www.bilibili.com/video/BV1hE411t7RN/以及Gemini的帮助，除实践外也包含了一些深度学习模型的原理

Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__, which is expected to return the size of the dataset by many ~torch.utils.data.Sampler implementations and the default options of ~torch.utils.data.DataLoader. Subclasses could also optionally implement __getitems__, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

note

~torch.utils.data.DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

一个表示数据集的抽象类。

所有实现从键（keys）到数据样本（data samples）映射的数据集都应该继承这个类。所有子类都必须重写 __getitem__ 方法，以支持根据给定的键获取数据样本。子类也可以选择性地重写 __len__ 方法，许多 ~torch.utils.data.Sampler 实现和 ~torch.utils.data.DataLoader 的默认选项都需要通过这个方法返回数据集的大小。子类还可以选择性地实现 __getitems__ 方法，以加速批量样本的加载。此方法接受批处理样本的索引列表，并返回样本列表。

注意

默认情况下，~torch.utils.data.DataLoader 会构造一个生成整数索引的采样器。要让其与非整数索引/键的映射式数据集一起工作，必须提供自定义的采样器。

自定义Dataset

那么在实际的使用中，比如有一个数据集（我们拿之前用过的fer2013来举例）：

先定义一个数据类继承Dataset，定义__init__()

1
2
3
4
5
6
7
8


from torch.utils.data import Dataset
import os
class MyData(Dataset):
 def __init__(self, root_dir, label_dir):
 self.root_dir = root_dir # 如 fer2013/test
 self.label_dir = label_dir # 如 angry
 self.path = os.path.join(self.root_dir, self.label_dir) # 路径合并函数 解决跨系统的文件路径问题
 self.img_path = os.listdir(self.path) # 如angry文件夹下，所有图片名的string list

重写__getitem()

1
2
3
4
5
6


def __getitem__(self, index):
 img_name = self.img_path[index]
 img_item_path = os.path.join(self.root_dir, self.label_dir, img_name) # 拼接出具体的某一个图片的路径
 img = Image.open(img_item_path)
 label = self.label_dir
 return img, label # 返回数据集中的一个对象（图片）及其类型

重写__len()__

1
2


def __len__(self):
 return len(self.img_path)

定义示例

1
2
3


root_dir = "fer2013/train"
angry_label_dir = "angry"
angry_dataset = MyData(root_dir, angry_label_dir)

Dataset合并

1
2
3
4


disguest_label_dir = "disguest"
disgust_dataset = MyData(root_dir, disguest_label_dir)

train_dataset = angry_dataset + disgust_dataset # '+'

TensorBoard

SummaryWriter

创建日志目录
初始化Writer对象
生成事件文件
- 这个文件才是真正的数据库(events.out……)当后续调用 writer.add_scalar 时，数据并不是直接画在屏幕上，而是被追加写进这个文件里。
- 当你执行完相应代码，运行 tensorboard --logdir=logs 时，TensorBoard 的后端服务器就会去读取这个文件里的内容，并在浏览器里渲染成图表。

1
2
3


from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter("logs") # 这行代码会在你的项目根目录下创建一个名为"logs"的文件夹

需注意，每次实验可以用不同的文件夹来记录数据，比如writer = SummaryWriter("logs/lr0.01_batch32")修改了学习率后writer = SummaryWriter("logs/lr0.001_batch32")

add_scalar()

绘图函数add_scalar()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


(method) def add_scalar(
 tag: Any,
 scalar_value: Any, # 对应图像的y轴
 global_step: Any | None = None, # 对应图像的x轴
 walltime: Any | None = None,
 new_style: bool = False,
 double_precision: bool = False
) -> None

​``` 例 ```

for i in range(100):
 writer.add_scalar("y=2x", 2*i, i)
writer.close()

add_image()

查看图片函数add_image()

1
2
3
4
5
6
7


(method) def add_image(
 tag: Any,
 img_tensor: Any,
 global_step: Any | None = None,
 walltime: Any | None = None,
 dataformats: str = "CHW"
) -> None

要注意参数的类型要求：img_tensor (torch.Tensor, numpy.ndarray, or string/blobname): Image data，所以要将PIL类型的图片进行转换，比如用numpy来转换。以及数据的shape要求： Tensor with :math:(1, H, W), :math:(H, W), :math:(H, W, 3) is also suitable as long as corresponding dataformats argument is passed, e.g. CHW, HWC, HW. 即三种图片数据的格式，其通道数、高度、宽度的顺序不同。

1
2
3
4
5


img = Image.open(image_path)
img_array = np.array(img)
print(f"图像形状: {img_array.shape}")
writer.add_image("test",img_array,1,dataformats='HW') # 由.shape可得
writer.close()

add_graph()

常见的Transforms

下面基本都是对图像的处理方法

ToTensor

Convert a PIL Image or ndarray to tensor and scale the values accordingly.

This transform does not support torchscript.

Converts a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] if the PIL Image belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the numpy.ndarray has dtype = np.uint8

In the other cases, tensors are returned without scaling.

note

Because the input image is scaled to [0.0, 1.0], this transformation should not be used when transforming target image masks. See the [references](vscode-file://vscode-app/d:/Microsoft VS Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) for implementing the transforms for image masks.
将PIL图片转为torch类型，如torch.Size([1, 48, 48])

1
2
3
4
5


trans_totensor = transforms.ToTensor()
img_tensor = trans_totensor(img)
print(img_tensor.shape)
writer.add_image("ToTensor",img_tensor)
writer.close()

Normalize

Normalize a tensor image with mean and standard deviation.

This transform does not support PIL Image.

Given mean: (mean[1],...,mean[n]) and std: (std[1],..,std[n]) for n

channels, this transform will normalize each channel of the input

torch.*Tensor i.e.,

output[channel] = (input[channel] - mean[channel]) / std[channel]

note:

This transform acts out of place, i.e., it does not mutate the input tensor.

Args:

mean (sequence): Sequence of means for each channel.

std (sequence): Sequence of standard deviations for each channel.

inplace(bool,optional): Bool to make this operation in-place.

Normailize 的数学原理是：

$$ output = \frac{input-mean}{std} $$

参数mean影响的是“中心位置”，在ToTensor之后，像素值在 $[0, 1]$ 之间，中心大约在 $0.5$，如mean = 0.5，那么减去 $0.5$ 后，数据的中心变成了 $0$。原本 $[0, 1]$ 的范围变成了 $[-0.5, 0.5]$
参数std影响的是“缩放幅度”，如std = 0.5，那么就是将数据范围再除以$0.5$，最终范围从 $[-0.5, 0.5]$ 变成了 $[-1, 1]$

Resize

Resize the input image to the given size.

If the image is torch Tensor, it is expected

to have […, H, W] shape, where … means a maximum of two leading dimensions

Args:

size (sequence or int): Desired output size. If size is a sequence like

(h, w), output size will be matched to this. If size is an int,

smaller edge of the image will be matched to this number.

i.e, if height > width, then image will be rescaled to

(size * height / width, size).

resize()支持PIL和Tensor两种图片格式，如果是 Tensor，期望形状为 [..., H, W]。这里的 ... 表示它可以处理 [C, H, W]（单张图）或 [B, C, H, W]（一个 Batch 的图）
参数size需注意写成序列形式，如resize((512, 512))，而如果只输入一个参数，如resize(512)，那么图片短边会变为512，而长边会按比例改变

1
2
3
4


print(img.size)
trans_resize = transforms.Resize((224,224))
img_resize = trans_resize(img)
print(img_resize)

Compse

Composes several transforms together. This transform does not support torchscript.

Please, see the note below.

Args:

transforms (list of Transform objects): list of transforms to compose.

compse()操作是对各种transforms操作的流水线类
在深度学习中，图片通常需要经过一系列的固定步骤（比如：缩放 -> 转为 Tensor -> 归一化）。如果不用 Compose，你每一张图都要手动调用多个函数，代码会非常冗余
compse()的参数类型是一个列表，操作是按顺序执行的，所以前一个操作输出的数据类型必须能作为下一个操作的输入。

1
2
3
4
5
6
7
8
9


from torchvision import transforms
# 定义训练集的预处理
train_transform = transforms.Compose([
 transforms.Resize((224, 224)), # VGG16标准输入是224x224
 transforms.RandomHorizontalFlip(), # 数据增强：随机水平翻转
 transforms.ToTensor(), # 归一化到 [0.0, 1.0]
 transforms.Normalize([0.5], [0.5]) # 标准化到 [-1.0, 1.0]
])
img_tensor = train_transform(img)

Pytorch数据集使用

例如，我们要导入视觉学习的数据集，可以直接在程序中进行数据集的下载

1
2
3
4
5


import torchvision

train_set = torchvision.datasets.CIFAR10(root="./dataset...", train=True, download=True)

test_set = torchvision.datasets.CIFAR10(root="./dataset...", train=false, download=True)

root参数表示数据集存放的位置，train参数表示数据集是否用来训练，download参数表示是否下载到本地（会生成一个下载链接）
具体的参数设置，每个数据集都可能有所区别……
如果是下载完成到本地的数据集，可以将其复制到项目目录的dataset文件夹下，再运行程序即可节省下载的时间

1
2
3
4
5
6


print(test_set.classes) # 可以看到测试数据集的所有类别

img, target = test_set[0]
print(img)
print(target)
print(test_set.classes[target]) # 输出测试集第一个元素对应的类别

DataLoader

Data loader combines a dataset and a sampler, and provides an iterable over the given dataset.

The :class:~torch.utils.data.DataLoader supports both map-style and

iterable-style datasets with single- or multi-process loading, customizing

loading order and optional automatic batching (collation) and memory pinning.

在训练模型时，不能一次性把数据集中的大量数据塞进内存。DataLoader实现了：

Batching（批处理）： 把图片打包成一组组（Batch）。
Shuffling（打乱）： 每一轮训练（Epoch）开始时随机洗牌，防止模型死记硬背数据顺序。
Parallel Computing（并行加载）： 利用多核 CPU 提前准备下一批数据，不让 GPU 等待。

参数名	常用取值	作用描述
dataset	自定义 Dataset	必填。告诉 DataLoader 从哪个“仓库”取数据。
batch_size	16, 32, 64…	每批装载的数量。越大训练越快，但越占显存。在 FER 表情识别中，32 或 64 是常用数值。
shuffle	`True` / `False`	是否打乱顺序。训练集通常设为 `True`（增加随机性）；测试集通常设为 `False`。
num_workers	0, 2, 4, 8…	多进程加载。 `0` 表示只用主进程（慢）。增加数值可以加快读取速度。建议：设为 CPU 核心数的一半。
drop_last	`True` / `False`	丢弃最后多余的数据。比如有 100 张图，`batch_size=32`。最后剩 4 张不够一包，设为 `True` 就会把这 4 张扔掉，确保每个 Batch 大小一致。
pin_memory	`True`	内存锁页。如果你用 GPU 训练，设为 `True` 可以加快数据从内存传输到显存的速度。

例如，我们用DataLoader处理CIFAR10中的数据

1
2
3


test_data = torchvision.datasets.CIFAR10("./dataset", train=False, transform=torchvision.transforms.ToTensor(), download=True)

test_loader = DataLoader(dataset=test_data, batch_size=4, shuffle=True, num_workers=0, drop_last=False)

结合循环和tensorboard，输出每一个epoch中每一step用到的图片
下面代码中step + epoch * len(test_loader)是使用了全局步长，也可以不这样写，而是把每个epoch来当作一组

1
2
3
4
5
6
7
8


writer = SummaryWriter("dataloader")
for epoch in range(2):
 step = 0
 for data in test_loader:
 imgs, targets = data
 writer.add_images("test_data_batch", imgs, step + epoch * len(test_loader))
 step = step + 1
writer.close()

nn.Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure.

在 PyTorch 中，无论是简单的线性层，还是复杂的 VGG16 或 Transformer，本质上都是一个 nn.Module。它是所有神经网络模块的基类
nn.Module 支持嵌套，当你对大模型调用 model.to("cuda") 时，PyTorch 会顺着这棵“树”，自动把里面所有的子层都搬到 GPU 上。
只要在 __init__ 中把一个层赋值给 self.xxx，PyTorch 就会自动识别出其中的权重 (Weights) 和 偏置 (Bias)，并将它们加入到待优化的参数列表中。

编写一个nn.Module子类时，必须重写__init()__和forward()

__init(self)__
- 在这里定义网络层（卷积、池化、全连接等）。
- 必须调用 super().__init__()。这行代码的作用是初始化父类的属性，如果没有它，PyTorch 就没法自动追踪定义的层，模型也就无法训练。
forward(self, x)
- 定义数据的流向。图片进去后，先过哪一层，再过哪一层。
- 不需要手动调用 forward，只需要运行 model(input)，PyTorch 会自动触发 forward。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


import torch
from torch import nn

class myModule(nn.Module):
 def __init__(self):
 super().__init__()

 def forward(self, input):
 output = input + 1 # 简单地将输出 +1 再输出
 return output

卷积 Conv

$$ \text{out}(N_i, C_{\text{out}_j}) = \text{bias}(C_{\text{out}_j}) + \sum_{k = 0}^{C_{\text{in}} - 1} \text{weight}(C_{\text{out}_j}, k) \star \text{input}(N_i, k) $$

卷积动画页面：conv_arithmetic/README.md at master · vdumoulin/conv_arithmetic

参数	含义	作用
in_channels	输入通道数	彩色图通常为 3 (RGB)，灰度图为 1。
out_channels	输出通道数	卷积核的数量。有多少个核，输出就有多少层特征图。
kernel_size	卷积核大小	提取特征的“窗口”大小。常用 3 或 5（VGG 常用 3）。
stride	步长	窗口滑动的跨度。默认为 1。步长越大，输出图片越小。
padding	填充	在图片四周补 0。`'same'` 保持大小不变，`'valid'` 则不补。
dilation	空洞卷积	卷积核点之间的间距。用于扩大感受野（不用增加参数量）。
bias	偏置	是否在结果上加一个常数偏移。默认开启。

卷积的Padding（边界扩充）参数很重要，如果周围不补零，那么卷积会导致图像尺寸越来越小
形状计算公式：

$$ H_{out} = \left\lfloor\frac{H_{in} + 2 \times \text{padding}[0] - \text{dilation}[0] \times (\text{kernel_size}[0] - 1) - 1}{\text{stride}[0]} + 1\right\rfloor $$

$W_{out}$计算同理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


dataset = torchvision.datasets.CIFAR10("./dataset",train = False, transform=torchvision.transforms.ToTensor(),download=True)
dataloader = DataLoader(dataset, batch_size=64)

class myModule(nn.Module):
 def __init__(self):
 super().__init__()
 self.conv1 = Conv2d(in_channels=3,out_channels=6,kernel_size=3,stride=1,padding=0)

 def forward(self,x):
 x = self.conv1(x)
 return x

mymodule = myModule() # 模型实例化 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


step = 0
for data in dataloader:
 imgs, targets = data
 output = mymodule(imgs)
 print(imgs.shape)
 # torch.Size([64, 3, 32, 32])
 print(output.shape)
 # torch.Size([64, 6, 30, 30]) channel == 6 卷积后，channel数变化，不能直接输出图像

 step = step + 1

（最大）池化 MaxPool

最大池化的逻辑非常简单：在一个窗口（Kernel）范围内，只保留最大的那个值，剩下的全部扔掉
既保留输入特征，又减小了数据量，加快训练速度

参数	独特之处
kernel_size	窗口大小。常见是 `2`（即将 $2 \times 2$ 的区域合并）。
stride	默认值等于 kernel_size！这和卷积不同。如果 `kernel_size=2`，步长默认就是 `2`，这样窗口之间就不会重叠。
ceil_mode	非常重要。默认是 `False`（向下取整）。如果设为 `True`（向上取整），当窗口超出边界时，只要窗口内有数据，就会保留结果，而不是舍弃。
padding	填充。注意池化填充的是负无穷（$-\infty$），这样是为了保证填充位不会被选为最大值。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


input = torch.tensor([[1,2,0,3,1],
 [0,1,2,3,1],
 [1,2,1,0,0],
 [5,2,3,1,1],
 [2,1,0,1,1]], dtype=float)

input = torch.reshape(input,(-1,1,5,5))
print(input.shape)

class myMoudle(nn.Module):
 def __init__(self, *args, **kwargs):
 super(myMoudle, self).__init__()
 self.maxpool1 = MaxPool2d(kernel_size=3, ceil_mode=True) 
 '''
 ceil_mode = False 就表示只有当池化核遇到的尺寸是最大尺寸（如3x3）时
 才会取其池化的结果，否则相反
 '''
 def forward(self, input):
 output = self.maxpool1(input)
 return output
 
mymoudle = myMoudle()
output = mymoudle(input)
print(output)

1
2
3


torch.Size([1, 1, 5, 5])
tensor([[[[2., 3.],
 [5., 1.]]]], dtype=torch.float64)

损失函数与反向传播

损失函数用于计算实际输出与目标之间的差距，为反向传播、更新参数提供一定的依据。在分类任务中，常用交叉熵函数来计算误差，

1

loss_func = nn.CrossEntropyLoss()

优化器

torch.optim — PyTorch 2.10 documentation

示例代码：

1
2
3
4
5
6


for input, target in dataset:
 optimizer.zero_grad() # 梯度清零
 output = model(input)
 loss = loss_fn(output, target) # 调用损失函数
 loss.backward() # 反向传播
 optimizer.step() 

Pytorch实战：CIFAR10

针对CIFAR10图像数据集的简单分类模型实战

首先了解Sequential，nn.Sequential 是 nn.Module 的一个特殊子类，它的作用是自动完成 forward 逻辑。注意：其中每一个参数都是某层的类，所以要写逗号。Sequential既简化了模型定义，也简化了forward()。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


class CIFAR10_Simple(nn.Module):
 def __init__(self, *args, **kwargs):
 super(CIFAR10_Simple, self).__init__(*args, **kwargs)
 self.conv1 = Conv2d(in_channels=3, out_channels=32, kernel_size=5, padding=2)
 '''
 参数padding的数值可以从想象图像得到：5x5的卷积核，当中心在图像的(0,0)，那么卷积核是扩展出去2格的。上面是简单判断的方法，实际应该用尺寸计算公式来代入计算（参考“Conv卷积”节内容）
 '''
 self.model_s = Sequential(
 Conv2d(in_channels=3, out_channels=32, kernel_size=5, padding=2),
 nn.ReLU(),
 MaxPool2d(kernel_size=2),
 Conv2d(in_channels=32, out_channels=32, kernel_size=5, padding=2),
 nn.ReLU(),
 MaxPool2d(2),
 Conv2d(32,64,5,padding=2),
 nn.ReLU(),
 MaxPool2d(2),
 Flatten(),
 Linear(1024, 64),
 nn.ReLU(),
 Linear(64, 10)
 )

 def forward(self, x):
 '''
 x = self.conv1(x)
 x = self.maxpool1(x)
 x = self.conv2(x)
 x = self.maxpool2(x)
 x = self.conv3(x)
 x = self.maxpool3(x)
 x = self.flatten(x)
 x = self.linear1(x)
 x = self.linear2(x)
 '''
 x = self.model_s(x)
 return x

在搭建模型前，先设置DataLoader处理数据集
- 分别设置好train_data和test_data

1
2
3
4
5
6


dataset_transfrom = tf.Compose([tf.ToTensor(),tf.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))])
train_data = torchvision.datasets.CIFAR10("./dataset", transform=dataset_transfrom, download=True)
test_data = torchvision.datasets.CIFAR10("./dataset", train=False, transform=dataset_transfrom, download=True)
# --------
train_loader = DataLoader(dataset=train_data, batch_size=64, shuffle=True, drop_last=True)
test_loader = DataLoader(dataset=test_data, batch_size=64, shuffle=True, drop_last=True)

搭建好模型后，简单测试输出尺寸是否符合要求

1
2
3
4
5


cifar = CIFAR10_Simple()
print(cifar)
input = torch.ones((64, 3, 32, 32)) # 同数据集图片尺寸的测试
output = cifar(input)
print(output.shape)

训练前的基本设置
- 定义TensorBoard的writer
- 设置device调用显卡加速训练
- 实例化训练模型
- 定义损失函数
- 设置优化器

1
2
3
4
5
6


writer = SummaryWriter("./logs")
writer.add_graph(cifar, input)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CIFAR10_Simple().to(device)
loss_func = nn.CrossEntropyLoss() # 交叉熵损失函数
optim = torch.optim.SGD(cifar.parameters(), lr=0.01) # 学习率

训练部分
- 优化器的定式代码
- 记录训练损失

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


total_step = 0
for epoch in range(10):
 # --- 训练部分 ---
 model.train()
 for data in train_loader:
 imgs, targets = data
 outputs = model(imgs.to(device))
 loss = loss_func(outputs, targets.to(device))
 
 optim.zero_grad()
 loss.backward()
 optim.step()
 
 # 记录训练损失
 writer.add_scalar("Train_Loss", loss.item(), total_step)
 total_step += 1

评估部分
- 每个epoch执行一次
- with torch.no_grad()：关闭梯度记录
- 统计性能指标

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


model.eval()
 total_test_loss = 0
 total_accuracy = 0
 with torch.no_grad(): # 测试时不需要计算梯度，节省性能
 for data in test_loader:
 imgs, targets = data
 imgs, targets = imgs.to(device), targets.to(device)
 outputs = model(imgs)
 
 # 计算总损失
 loss = loss_func(outputs, targets)
 total_test_loss += loss.item()
 
 # 计算准确率：argmax(1) 找到概率最大的类别索引
 accuracy = (outputs.argmax(1) == targets).sum()
 total_accuracy += accuracy

可视化

1
2
3
4
5
6
7


# 输出到 TensorBoard
 writer.add_scalar("Test_Loss", total_test_loss / len(test_loader), epoch)
 writer.add_scalar("Test_Accuracy", total_accuracy / len(test_data), epoch)
 
 print(f"Epoch {epoch+1} 结束，准确率: {total_accuracy / len(test_data)}")

writer.close()