《第九章》

9.1图像增广

为了在预测时得到确定的结果，通常只将图像增广应用在训练样本上，而不在预测时使用含随机操作的图像增广。

图像增广：将图片翻转，缩放扩大，随机截取，调整色调，明亮等操作以生成新的数据集。

import gc
gc.collect()  # 清理内存

练习

不使用图像增广训练模型：train_with_data_aug(no_aug, no_aug)。比较有无图像增广时的训练准确率和测试准确率。该对比实验能否支持图像增广可以应对过拟合这一论断？为什么？

答：支持，正如下述训练，训练集拟合程度提高更快，且测试集结果也不如有图像增广的高。

train_with_data_aug(no_aug, no_aug)
——————————————————————————————结果——————————————————————
training on [gpu(0)]
epoch 1, loss 1.3485, train acc 0.522, test acc 0.556, time 62.9 sec
epoch 2, loss 0.7872, train acc 0.722, test acc 0.705, time 65.0 sec
epoch 3, loss 0.5654, train acc 0.802, test acc 0.738, time 67.0 sec
epoch 4, loss 0.4175, train acc 0.853, test acc 0.777, time 67.9 sec
epoch 5, loss 0.3043, train acc 0.895, test acc 0.789, time 67.8 sec
epoch 6, loss 0.2183, train acc 0.923, test acc 0.799, time 68.2 sec
epoch 7, loss 0.1547, train acc 0.946, test acc 0.810, time 68.5 sec
epoch 8, loss 0.1150, train acc 0.960, test acc 0.799, time 68.9 sec
epoch 9, loss 0.0814, train acc 0.972, test acc 0.809, time 69.1 sec
epoch 10, loss 0.0725, train acc 0.974, test acc 0.806, time 70.2 sec

在基于CIFAR-10数据集的模型训练中增加不同的图像增广方法。观察实现结果。

答：每轮的耗时不变，测试集效率有提高，且测试集拟合的也比较慢。

complex_aug = gdata.vision.transforms.Compose([
    gdata.vision.transforms.RandomFlipLeftRight(),
    gdata.vision.transforms.RandomHue(0.5),
    gdata.vision.transforms.ToTensor()])
train_with_data_aug(complex_aug, no_aug)
————————————————————————————————结果————————————————————————
training on [gpu(0)]
epoch 1, loss 1.5822, train acc 0.446, test acc 0.496, time 69.3 sec
epoch 2, loss 0.9240, train acc 0.673, test acc 0.676, time 68.4 sec
epoch 3, loss 0.6791, train acc 0.764, test acc 0.739, time 68.6 sec
epoch 4, loss 0.5490, train acc 0.810, test acc 0.728, time 69.8 sec
epoch 5, loss 0.4555, train acc 0.842, test acc 0.777, time 70.5 sec
epoch 6, loss 0.3836, train acc 0.868, test acc 0.762, time 70.2 sec
epoch 7, loss 0.3227, train acc 0.889, test acc 0.795, time 69.8 sec
epoch 8, loss 0.2728, train acc 0.906, test acc 0.807, time 70.0 sec
epoch 9, loss 0.2392, train acc 0.918, test acc 0.823, time 70.8 sec
epoch 10, loss 0.1931, train acc 0.934, test acc 0.820, time 70.1 sec

查阅MXNet文档，Gluon的transforms模块还提供了哪些图像增广方法？

方法

涵义

transforms.Compose

Sequentially composes multiple transforms.

transforms.Cast

Cast inputs to a specific data type

transforms.ToTensor

Converts an image NDArray or batch of image NDArray to a tensor NDArray.

transforms.Normalize

Normalize an tensor of shape (C x H x W) or (N x C x H x W) with mean and standard deviation.

transforms.RandomResizedCrop

Crop the input image with random scale and aspect ratio.

transforms.CenterCrop

Crops the image src to the given size by trimming on all four sides and preserving the center of the image.

transforms.Resize

Resize an image or a batch of image NDArray to the given size.

transforms.RandomFlipLeftRight

Randomly flip the input image left to right with a probability of p(0.5 by default).

transforms.RandomFlipTopBottom

Randomly flip the input image top to bottom with a probability of p(0.5 by default).

transforms.RandomBrightness

Randomly jitters image brightness with a factor chosen from [max(0, 1 - brightness), 1 + brightness].

transforms.RandomContrast

Randomly jitters image contrast with a factor chosen from [max(0, 1 - contrast), 1 + contrast].

transforms.RandomSaturation

Randomly jitters image saturation with a factor chosen from [max(0, 1 - saturation), 1 + saturation].

transforms.RandomHue

Randomly jitters image hue with a factor chosen from [max(0, 1 - hue), 1 + hue].

transforms.RandomColorJitter

Randomly jitters the brightness, contrast, saturation, and hue of an image.

transforms.RandomLighting

Add AlexNet-style PCA-based noise to an image.

9.2微调

迁移学习：下载预训练过的模型参数，而后以较小的学习率微调隐藏层，以较大的学习率从头学习输出层。

Gluon的model_zoo包提供了常用的预训练模型。

GluonCV工具包有更多计算机视觉的预训练模型。

微调步骤：

加载预训练模型
定义微调模型，features直接套用预加载的，output重新初始化，并设置多倍学习率。
训练函数，记得设置本机的ctx。

pretrained_net = model_zoo.vision.resnet18_v2(pretrained=True)

finetune_net = model_zoo.vision.resnet18_v2(classes=2)
finetune_net.features = pretrained_net.features
finetune_net.output.initialize(init.Xavier())
# output中的模型参数将在迭代中使用10倍大的学习率
finetune_net.output.collect_params().setattr('lr_mult', 10)

def train_fine_tuning(net, learning_rate, batch_size=128, num_epochs=5):
    train_iter = gdata.DataLoader(
        train_imgs.transform_first(train_augs), batch_size, shuffle=True)
    test_iter = gdata.DataLoader(
        test_imgs.transform_first(test_augs), batch_size)
    ctx = d2l.try_all_gpus()
    net.collect_params().reset_ctx(ctx)
    net.hybridize()
    loss = gloss.SoftmaxCrossEntropyLoss()
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {
        'learning_rate': learning_rate, 'wd': 0.001})
    d2l.train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs)

练习

不断增大finetune_net的学习率。准确率会有什么变化？
答：学习率调到0.05，准确率显著降低，且提高的也很慢。说明较大程度改变了原来的模型参数，可能会造成糟糕的结果。
进一步调节对比试验中finetune_net和scratch_net的超参数。它们的精度是不是依然有区别？
答：增加了scratch_netnum_epochs为10，后面周期的准确率增加还是没有赶上finetune_net。说明两者还是有很多不同的，不仅仅在迭代周期上。题目中的精度不是很明白在问什么。

将finetune_net.features中的参数固定为源模型的参数而不在训练中迭代，结果会怎样？你可能会用到以下代码。

答：结果训练集的准确率提高不多，甚至还有时下降了，测试的准确率也基本上不变。

finetune_net.features.collect_params().setattr('grad_req', 'null')
————————————————输出——————————————————
training on [gpu(0)]
epoch 1, loss 0.4164, train acc 0.824, test acc 0.849, time 13.2 sec
epoch 2, loss 0.4104, train acc 0.820, test acc 0.848, time 13.3 sec
epoch 3, loss 0.4065, train acc 0.812, test acc 0.849, time 13.1 sec
epoch 4, loss 0.3911, train acc 0.820, test acc 0.850, time 13.1 sec
epoch 5, loss 0.3945, train acc 0.822, test acc 0.846, time 13.0 sec

9.3目标检测和边界框

目标检测：辨认出目标所在的位置

边界框：给目标加上一个方框

# bbox是bounding box的缩写
# 左上角的（x,y）,右下角的（x,y）
dog_bbox, cat_bbox = [60, 45, 378, 516], [400, 112, 655, 493]
def bbox_to_rect(bbox, color):  # 本函数已保存在d2lzh包中方便以后使用
    # 将边界框(左上x, 左上y, 右下x, 右下y)格式转换成matplotlib格式：
    # ((左上x, 左上y), 宽, 高)
    return d2l.plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)

fig = d2l.plt.imshow(img)
# axes是坐标轴
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

练习

找一些图像，尝试标注其中目标的边界框。比较标注边界框与标注类别所花时间的差异。
答：手动标注的肯定很慢，自动标注的还没学到。

9.4锚框

锚框：以每个像素为中心生成多个大小和宽高比（aspect ratio）不同的边界框。

生成锚框：设定大小s_1——s_n，宽高比r_1——r_n.我们通常只对包含s_1或r_1的组合感兴趣。

(s_1, r_1), (s_1, r_2), \ldots, (s_1, r_m), (s_2, r_1), (s_3, r_1), \ldots, (s_n, r_1).

以相同像素为中心的锚框的数量为n+m−1。对于整个输入图像有wh(n+m-1)个锚框。

img = image.imread('../img/catdog.jpg').asnumpy()
h, w = img.shape[0:2] # 高，宽
X = nd.random.uniform(shape=(1, 3, h, w))  # 构造输入数据
# 生成锚框y的形状为（批量大小，锚框个数，4)
# 锚框个数=wh(n+m-1)，4为左上右下坐标
# n为sizes个数，m为ratios个数
Y = contrib.nd.MultiBoxPrior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
# 变为（图像高，图像宽，以相同像素为中心的锚框个数，4）
# 可以通过指定像素位置来获取所有以该像素为中心的锚框
boxes = Y.reshape((h, w, 5, 4))
boxes[250, 250, 0, :]

交并比：也叫Jaccard系数，衡量两个集合的相似度。

J(\mathcal{A},\mathcal{B}) = \frac{\left|\mathcal{A} \cap \mathcal{B}\right|}{\left| \mathcal{A} \cup \mathcal{B}\right|}.

标注训练集的锚框的类别和偏移值：为每个锚框标注两类标签：一是锚框所含目标的类别，简称类别；二是真实边界框相对锚框的偏移量，简称偏移量（offset）。每次取相似度矩阵X中的最大值（且大于阈值），并将所在行和列丢弃。

ground_truth = nd.array([[0, 0.1, 0.08, 0.52, 0.92],
                         [1, 0.55, 0.2, 0.9, 0.88]])
anchors = nd.array([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])
# 为锚框标注类别和偏移量,交并比小于阈值（默认为0.5）
labels = contrib.nd.MultiBoxTarget(anchors.expand_dims(axis=0),# (1,2,5)
                                 ground_truth.expand_dims(axis=0), # (1,5,4)
                                 nd.zeros((1, 3, 5)))# 结果为(批量，类别，锚框)
labels[2] # 返回结果第三项为类别
# 第二项为掩码（mask）变量，形状为(批量大小, 锚框个数的四倍)。
labels[1] # 0可以在计算目标函数之前过滤掉负类的偏移量。
# 第一项是为每个锚框标注的四个偏移量，其中负类锚框的偏移量标注为0。
labels[0]

非极大值抑制：先为图像生成多个锚框，并为这些锚框一一预测类别和偏移量。随后，我们根据锚框及其预测偏移量得到预测边界框。移除相似的预测边界框。常用的方法叫作非极大值抑制（non-maximum suppression，NMS）。

# 可以移除相似的预测边界框。非极大值抑制
anchors = nd.array([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],
                    [0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
# 假设预测偏移量全是0：预测边界框即锚框
offset_preds = nd.array([0] * anchors.size)
cls_probs = nd.array([[0] * 4,  # 背景的预测概率
                      [0.9, 0.8, 0.7, 0.1],  # 狗的预测概率
                      [0.1, 0.2, 0.3, 0.9]])  # 猫的预测概率
# MultiBoxDetection函数来执行非极大值抑制并设阈值为0.5
output = contrib.ndarray.MultiBoxDetection(
    cls_probs.expand_dims(axis=0), offset_preds.expand_dims(axis=0),
    anchors.expand_dims(axis=0), nms_threshold=0.5)
output
# （0为狗，1为猫）-1表示背景或在非极大值抑制中被移除。第二个元素是预测边界框的置信度。
——————————————输出——————————————
[[[ 0.    0.9   0.1   0.08  0.52  0.92]
  [ 1.    0.9   0.55  0.2   0.9   0.88]
  [-1.    0.8   0.08  0.2   0.56  0.95]
  [-1.    0.7   0.15  0.3   0.62  0.91]]]
<NDArray 1x4x6 @cpu(0)>
—————————————根据矩阵输出图形——————————————
# 除掉类别为-1的预测边界框，并可视化非极大值抑制保留的结果。
fig = d2l.plt.imshow(img)
for i in output[0].asnumpy():
    if i[0] == -1:
        continue
    label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
    show_bboxes(fig.axes, [nd.array(i[2:]) * bbox_scale], label)

np.set_printoptions(2)：设置NDArray小数点后只打印2位。

numpy中的expand_dims(axis=0)：在第'axis'维，加一个维度出来，原先的维度向’右边‘推。

练习

改变MultiBoxPrior函数中sizes和ratios的取值，观察生成的锚框的变化。
答：修改后，以相同像素为中心的锚框个数记得修改。
```
# 将锚框变量y的形状变为（图像高，图像宽，以相同像素为中心的锚框个数，4）
boxes = Y.reshape((h, w, 5, 4))
```
构造交并比为0.5的两个边界框，观察它们的重合度。
答：重合度占两个边框之和的一半。
按本节定义的为锚框标注偏移量的方法（常数采用默认值），验证偏移量labels[0]的输出结果。
答：没错。
$\left( \frac{ \frac{x_b - x_a}{w_a} - \mu_x }{\sigma_x}, \frac{ \frac{y_b - y_a}{h_a} - \mu_y }{\sigma_y}, \frac{ \log \frac{w_b}{w_a} - \mu_w }{\sigma_w}, \frac{ \log \frac{h_b}{h_a} - \mu_h }{\sigma_h}\right)\\ \mu_x = \mu_y = \mu_w = \mu_h = 0, \sigma_x=\sigma_y=0.1, \sigma_w=\sigma_h=0.2$
修改“标注训练集的锚框”与“输出预测边界框”两小节中的变量anchors，结果有什么变化？
答：锚框发生改变。根据坐标位置而改变。

9.5多尺度目标检测

减少锚框：在输入图像中均匀采样一小部分像素，并以采样的像素为中心生成锚框。

# 在任一图像上均匀采样fmap_h行fmap_w列个像素，并分别以它们为中心
# 生成大小为s（假设列表s长度为1）的不同宽高比（ratios）的锚框。
def display_anchors(fmap_w, fmap_h, s):
    fmap = nd.zeros((1, 10, fmap_w, fmap_h))  # 前两维的取值不影响输出结果
    anchors = contrib.nd.MultiBoxPrior(fmap, sizes=s, ratios=[1, 2, 0.5])
    bbox_scale = nd.array((w, h, w, h))
    d2l.show_bboxes(d2l.plt.imshow(img.asnumpy()).axes,
                    anchors[0] * bbox_scale)

display_anchors(fmap_w=4, fmap_h=4, s=[0.15])

可以在多个尺度下生成不同数量和不同大小的锚框，从而在多个尺度下检测不同大小的目标。

用输入图像在某个感受野区域内的信息来预测输入图像上与该区域相近的锚框的类别和偏移量。

练习

给定一张输入图像，设特征图变量的形状为1×ci×h×w，其中ci、h和w分别为特征图的个数、高和宽。你能想到哪些将该变量变换为锚框的类别和偏移量的方法？输出的形状分别是什么？
答：卷积？形状不知。我们可以将特征图在相同空间位置的ci个单元变换为以该位置为中心生成的a个锚框的类别和偏移量。本质上，我们用输入图像在某个感受野区域内的信息来预测输入图像上与该区域位置相近的锚框的类别和偏移量。

9.6目标检测数据集（皮卡丘）

numpy的transpose()：将维度进行置换，两维的为转置矩阵。

imgs = (batch.data[0][0:10].transpose((0, 2, 3, 1))) / 255
# 将10*3*256*256的数组变为10*256*256*3的数组再除以255

# 本函数已保存在d2lzh包中方便以后使用
# 返回num_rows, num_cols的坐标轴。
def show_images(imgs, num_rows, num_cols, scale=2):
    figsize = (num_cols * scale, num_rows * scale)
    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
    for i in range(num_rows):
        for j in range(num_cols):
            axes[i][j].imshow(imgs[i * num_cols + j].asnumpy())
            axes[i][j].axes.get_xaxis().set_visible(False)
            axes[i][j].axes.get_yaxis().set_visible(False)
    return axes
axes = d2l.show_images(imgs, 2, 5).flatten()

练习

查阅MXNet文档，image.ImageDetIter和image.CreateDetAugmenter这两个类的构造函数有哪些参数？它们的意义是什么？
- image.ImageDetIter的Parameters
  - batch_size**,** data_shape两个一定要指定。
  - aug_list (list or None) – Augmenter list for generating distorted images
  - batch_size (int) – Number of examples per batch.
  - data_shape (tuple) – Data shape in (channels, height, width) format. For now, only RGB image with 3 channels is supported.
  - path_imgrec (str) – Path to image record file (.rec). Created with tools/im2rec.py or bin/im2rec.
  - path_imglist (str) – Path to image list (.lst). Created with tools/im2rec.py or with custom script. Format: Tab separated record of index, one or more labels and relative_path_from_root.
  - imglist (list) – A list of images with the label(s). Each item is a list [imagelabel: float or list of float, imgpath].
  - path_root (str) – Root folder of image files.
  - path_imgidx (str) – Path to image index file. Needed for partition and shuffling when using .rec source.
  - shuffle (bool) – Whether to shuffle all images at the start of each iteration or not. Can be slow for HDD.
  - part_index (int) – Partition index.
  - num_parts (int) – Total number of partitions.
  - data_name (str) – Data name for provided symbols.
  - label_name (str) – Name for detection labels
  - last_batch_handle (str**, optional) – How to handle the last batch. This parameter can be ‘pad’(default), ‘discard’ or ‘roll_over’. If ‘pad’, the last batch will be padded with data starting from the begining If ‘discard’, the last batch will be discarded If ‘roll_over’, the remaining elements will be rolled over to the next iteration
  - kwargs – More arguments for creating augmenter. See mx.image.CreateDetAugmenter.
  - image.CreateDetAugmenter的Parameters
    data_shape (tuple of int) – Shape for output data
    resize (int) – Resize shorter edge if larger than 0 at the begining
    rand_crop (float) – [0, 1], probability to apply random cropping
    rand_pad (float) – [0, 1], probability to apply random padding
    rand_gray (float) – [0, 1], probability to convert to grayscale for all channels
    rand_mirror (bool) – Whether to apply horizontal flip to image with probability 0.5
    mean (np.ndarray or None) – Mean pixel values for [r, g, b]
    std (np.ndarray or None) – Standard deviations for [r, g, b]
    brightness (float) – Brightness jittering range (percent)
    contrast (float) – Contrast jittering range (percent)
    saturation (float) – Saturation jittering range (percent)
    hue (float) – Hue jittering range (percent)
    pca_noise (float) – Pca noise level (percent)
    inter_method (int**, default=2(Area-based**)) –
    Interpolation method for all resizing operations
    Possible values: 0: Nearest Neighbors Interpolation. 1: Bilinear interpolation. 2: Area-based (resampling using pixel area relation). It may be a preferred method for image decimation, as it gives moire-free results. But when the image is zoomed, it is similar to the Nearest Neighbors method. (used by default). 3: Bicubic interpolation over 4x4 pixel neighborhood. 4: Lanczos interpolation over 8x8 pixel neighborhood. 9: Cubic for enlarge, area for shrink, bilinear for others 10: Random select from interpolation method metioned above. Note: When shrinking an image, it will generally look best with AREA-based interpolation, whereas, when enlarging an image, it will generally look best with Bicubic (slow) or Bilinear (faster but still looks OK).
    min_object_covered (float) – The cropped area of the image must contain at least this fraction of any bounding box supplied. The value of this parameter should be non-negative. In the case of 0, the cropped area does not need to overlap any of the bounding boxes supplied.
    min_eject_coverage (float) – The minimum coverage of cropped sample w.r.t its original size. With this constraint, objects that have marginal area after crop will be discarded.
    aspect_ratio_range (tuple of floats) – The cropped area of the image must have an aspect ratio = width / height within this range.
    area_range (tuple of floats) – The cropped area of the image must contain a fraction of the supplied image within in this range.
    max_attempts (int) – Number of attempts at generating a cropped/padded region of the image of the specified constraints. After max_attempts failures, return the original image.
    pad_val (float) – Pixel value to be filled when padding is enabled. pad_val will automatically be subtracted by mean and divided by std if applicable.

9.7单发多框检测

numpy的flatten()：默认将数组按行变换展开。返回的是拷贝，而ravel()会修改数据。

>>> a = np.array([[1,2], [3,4]])
>>> a.flatten()   # 默认参数为"C"，即按照行进行重组
array([1, 2, 3, 4])
>>> a.flatten('F') # 按照列进行重组
array([1, 3, 2, 4])

填充为1的3×3卷积层不改变特征图的形状。

感受野计算公式：

单发多框检测模型（single shot multibox detection，SSD）：一共包含5个模块，每个模块输出的特征图既用来生成锚框，又用来预测这些锚框的类别和偏移量。第一模块为基础网络块，第二模块至第四模块为高和宽减半块，第五模块使用全局最大池化层将高和宽降到1。

# 0.边框预测层， 每个锚框4个偏移量
def bbox_predictor(num_anchors):
    return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)
# 0.类别预测层
def cls_predictor(num_anchors, num_classes):
    return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3,
                     padding=1)
# 0.转换维度后扁平化
def flatten_pred(pred):
    return pred.transpose((0, 2, 3, 1)).flatten()
# 0.多尺度连结预测结果。
# 将预测结果转化为(批量大小, 高 × 宽 × 通道数)，之后在维度1上连结
def concat_preds(preds):
    return nd.concat(*[flatten_pred(p) for p in preds], dim=1)
# 1.基础网络块用来从原始图像中抽取特征。
# 该网络串联3个高和宽减半块，并逐步将通道数翻倍。
def base_net():
    blk = nn.Sequential()
    for num_filters in [16, 32, 64]:
        blk.add(down_sample_blk(num_filters))
    return blk
# 2.宽高减半块，需要先于基础网络块定义
# 宽高减半，可以改变通道数，每个感受野6*6
def down_sample_blk(num_channels):
    blk = nn.Sequential()
    for _ in range(2):
        blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1),
                nn.BatchNorm(in_channels=num_channels), # 批量归一化
                nn.Activation('relu'))
    blk.add(nn.MaxPool2D(2))
    return blk
# 3.全局最大池化层
# 4.完整SSD模型
def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 4:
        blk = nn.GlobalMaxPool2D()
    else:
        blk = down_sample_blk(128)
    return blk
# 5.前向计算，返回（特征图Y，锚框anchors，预测类别，预测偏移量）
def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    Y = blk(X)
    anchors = contrib.ndarray.MultiBoxPrior(Y, sizes=size, ratios=ratio)
    cls_preds = cls_predictor(Y)
    bbox_preds = bbox_predictor(Y)
    return (Y, anchors, cls_preds, bbox_preds)
# 6.定义大小，宽高比，锚框数
sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
         [0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) - 1
# 7.定义模型类
class TinySSD(nn.Block):
    def __init__(self, num_classes, **kwargs):
        super(TinySSD, self).__init__(**kwargs)
        self.num_classes = num_classes
        # 设置每层的神经网络层，类别预测函数，偏移预测函数
        for i in range(5):
            # 即赋值语句self.blk_i = get_blk(i)
            setattr(self, 'blk_%d' % i, get_blk(i))
            setattr(self, 'cls_%d' % i, cls_predictor(num_anchors,num_classes))
            setattr(self, 'bbox_%d' % i, bbox_predictor(num_anchors))

    def forward(self, X):
        anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
        # 计算每层的（特征图Y，锚框anchors，预测类别，预测偏移量）
        for i in range(5):
            # getattr(self, 'blk_%d' % i)即访问self.blk_i
            X, anchors[i], 
            cls_preds[i], 
            bbox_preds[i] = blk_forward(X,getattr(self, 'blk_%d' % i), 
                                        sizes[i], ratios[i],
                                        getattr(self, 'cls_%d' % i), 
                                        getattr(self, 'bbox_%d' % i))
        # reshape函数中的0表示保持批量大小不变
        return (nd.concat(*anchors, dim=1),
                concat_preds(cls_preds).reshape((0, -1, self.num_classes + 1)), 
                concat_preds(bbox_preds))
# 8.定义损失函数
# 有关锚框类别的损失，交叉熵损失函数
cls_loss = gloss.SoftmaxCrossEntropyLoss()
# 正类锚框偏移量的损失，L1 范数损失，即预测值与真实值之间差的绝对值。
bbox_loss = gloss.L1Loss()
# 总的损失函数
def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    cls = cls_loss(cls_preds, cls_labels)
    bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
    return cls + bbox
# 沿用准确率评价分类结果
def cls_eval(cls_preds, cls_labels):
    # 由于类别预测结果放在最后一维，argmax需要指定最后一维
    return (cls_preds.argmax(axis=-1) == cls_labels).sum().asscalar()
# L1 范数损失，我们用平均绝对误差评价边界框的预测结果。
def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return ((bbox_labels - bbox_preds) * bbox_masks).abs().sum().asscalar()
# 9.训练模型
for epoch in range(20):
    acc_sum, mae_sum, n, m = 0.0, 0.0, 0, 0
    train_iter.reset()  # 从头读取数据
    start = time.time()
    for batch in train_iter:
        X = batch.data[0].as_in_context(ctx)
        Y = batch.label[0].as_in_context(ctx)
        with autograd.record():
            # 生成多尺度的锚框，为每个锚框预测类别和偏移量
            anchors, cls_preds, bbox_preds = net(X)
            # 为每个锚框标注类别和偏移量
            bbox_labels, bbox_masks, cls_labels = contrib.nd.MultiBoxTarget(
                anchors, Y, cls_preds.transpose((0, 2, 1)))
            # 根据类别和偏移量的预测和标注值计算损失函数
            l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                          bbox_masks)
        l.backward()
        trainer.step(batch_size)
        acc_sum += cls_eval(cls_preds, cls_labels)
        n += cls_labels.size
        mae_sum += bbox_eval(bbox_preds, bbox_labels, bbox_masks)
        m += bbox_labels.size

    if (epoch + 1) % 5 == 0:
        print('epoch %2d, class err %.2e, bbox mae %.2e, time %.1f sec' % (
            epoch + 1, 1 - acc_sum / n, mae_sum / m, time.time() - start))

单发多框检测在训练中根据类别和偏移量的预测和标注值分别计算损失函数，类别可以用交叉熵损失、焦点损失；偏移量可以用L1范数损失、平滑L1范数损失。

练习

限于篇幅，实验中忽略了单发多框检测的一些实现细节。你能从以下几个方面进一步改进模型吗？

偏移量预测改进平滑L1范数：

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    cls = cls_loss(cls_preds, cls_labels)
    # bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
    bbox = nd.smooth_l1(bbox_preds * bbox_masks - bbox_labels * bbox_masks, scale=0.3).mean(axis=1) # 通过预测和标签的差值，作为x输入平滑L1范数函数
    return cls + bbox

类别预测改进为焦点损失：

# 1.焦点损失函数定义，x为真实类别j的预测概率
def focal_loss(gamma, x):
    return -(1 - x) ** gamma * x.log()
# 2.softmax函数，转化为概率。
def softmax(X):
    X_exp = X.exp()
    partition = X_exp.sum(axis=1, keepdims=True)
    return X_exp / partition 
# 3.计算总的损失函数
def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    # 这一步使用了取巧的方法，因为不知道怎么通过cls_preds和cls_labels来求概率
    # 所以利用交叉熵公式-logpi反求出pi,再将其作为参数x输入焦点损失函数
    # 但是貌似效果不好，暂无进一步解决方案。
    cls = cls_loss(cls_preds, cls_labels)
    cls = focal_loss(1.0 , (-cls).exp())
    bbox = nd.smooth_l1(bbox_preds * bbox_masks - bbox_labels * bbox_masks, scale=0.3).mean(axis=1)
    return cls + bbox

9.8区域卷积神经网络（R-CNN)

R-CNN模型：

Fast R-CNN：

Faster R-CNN：

使用填充为1的3×3卷积层变换卷积神经网络的输出，并将输出通道数记为c。这样，卷积神经网络为图像抽取的特征图中的每个单元均得到一个长度为cc的新特征。
以特征图每个单元为中心，生成多个不同大小和宽高比的锚框并标注它们。
用锚框中心单元长度为cc的特征分别预测该锚框的二元类别（含目标还是背景）和边界框。
使用非极大值抑制，从预测类别为目标的预测边界框中移除相似的结果。最终输出的预测边界框即兴趣区域池化层所需要的提议区域。

Mask R-CNN：

# 兴趣池化层，只池化感兴趣的提议地方。
X = nd.arange(16).reshape((1, 1, 4, 4))
rois = nd.array([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]])
# 由于X的高宽是图像0.1，所以两个提议区域中的坐标先按spatial_scale自乘0.1，然后分别标出兴趣区域
nd.ROIPooling(X, rois, pooled_size=(2, 2), spatial_scale=0.1)

练习

了解GluonCV工具包中有关本节中各个模型的实现 [6]。

答：详细教程：https://gluon-cv.mxnet.io/model_zoo/detection.html

# Faster R-CNN 
!pip install gluoncv
from matplotlib import pyplot as plt
import gluoncv
from gluoncv import model_zoo, data, utils
# 预训练模型
net = model_zoo.get_model('faster_rcnn_resnet50_v1b_voc', pretrained=True)
# 下载一张图片
im_fname = utils.download('https://github.com/dmlc/web-data/blob/master/' +
                          'gluoncv/detection/biking.jpg?raw=true',
                          path='biking.jpg')
# 转换成加载图片                          
x, orig_img = data.transforms.presets.rcnn.load_test(im_fname)
# 前向计算
box_ids, scores, bboxes = net(x)
# 展示结果
ax = utils.viz.plot_bbox(orig_img, bboxes[0], scores[0], box_ids[0], class_names=net.classes)

plt.show()

9.9语义分割和数据集

# 下载voc_pascal数据集，本函数已保存在d2lzh包中方便以后使用
def download_voc_pascal(data_dir='../data'):
    voc_dir = os.path.join(data_dir, 'VOCdevkit/VOC2012')
    url = ('http://host.robots.ox.ac.uk/pascal/VOC/voc2012'
           '/VOCtrainval_11-May-2012.tar')
    sha1 = '4e443f8a2eca6b1dac8a6c57641b67dd40621a49'
    fname = gutils.download(url, data_dir, sha1_hash=sha1)
    with tarfile.open(fname, 'r') as f:
        f.extractall(data_dir)
    return voc_dir
# 读取voc_pascal数据集的输入图和标签到内存，本函数已保存在d2lzh包中方便以后使用
def read_voc_images(root=voc_dir, is_train=True):
    txt_fname = '%s/ImageSets/Segmentation/%s' % (
        root, 'train.txt' if is_train else 'val.txt')
    with open(txt_fname, 'r') as f:
        images = f.read().split()
    features, labels = [None] * len(images), [None] * len(images)
    for i, fname in enumerate(images):
        features[i] = image.imread('%s/JPEGImages/%s.jpg' % (root, fname))
        labels[i] = image.imread(
            '%s/SegmentationClass/%s.png' % (root, fname))
    return features, labels
# 使用数据集
voc_train = VOCSegDataset(True, crop_size, voc_dir, colormap2label)
voc_test = VOCSegDataset(False, crop_size, voc_dir, colormap2label)
batch_size = 64
num_workers = 0 if sys.platform.startswith('win32') else 4
train_iter = gdata.DataLoader(voc_train, batch_size, shuffle=True,
                              last_batch='discard', num_workers=num_workers)
test_iter = gdata.DataLoader(voc_test, batch_size, last_batch='discard',
                             num_workers=num_workers)

练习

回忆“图像增广”一节中的内容。哪些在图像分类中使用的图像增广方法难以用于语义分割？
答：裁剪而没有放大到同样大小的不行。

9.10FCN全卷积网络

矩阵乘法实现卷积：看下方转置卷积教程。

转置卷积：可参考https://blog.csdn.net/tsyccnh/article/details/87357447，讲的更形象。转置卷积就是将卷积的结果乘以一个权重，而变回卷积之前的形状，不能恢复到原始数值。如果步幅为s、填充为s/2（假设s/2为整数）、卷积核的高和宽为2s，转置卷积核将输入的高和宽分别放大s倍。

上采样：放大。常用双线性插值的方法。

下采样：缩小。

FCN模型：全卷积网络先使用卷积神经网络抽取图像特征，然后通过1×1卷积层将通道数变换为类别个数，最后通过转置卷积层将特征图的高和宽变换为输入图像的尺寸。模型输出与输入图像的高和宽相同，并在空间位置上一一对应：最终输出的通道包含了该空间位置像素的类别预测。

X[::3]，从第0个开始，每隔三个显示。

练习

用矩阵乘法来实现卷积运算是否高效？为什么？
答：不高效，还要涉及矩阵的变换，
如果将转置卷积层改用Xavier随机初始化，结果有什么变化？
答：结果是准确率卡在了0.729左右，后续迭代，损失值下降，但是没有提高准确率。
调节超参数，能进一步提升模型的精度吗？
答：只提高学习率会导致overfit；只提高batch_size导致内存溢出。且由于占用内存和时间过大，调参显得很麻烦，考虑使用云计算平台。
预测测试图像中所有像素的类别。
答：预测像素的类别？不理解题意。如果是预测所有图像，提高n的值就可以了。
全卷积网络的论文中还使用了卷积神经网络的某些中间层的输出 [1]。试着实现这个想法。
答：第一次看此书，先不看论文。下次看pytorch实现的版本，再看论文。

9.11样式迁移

样式迁移（style transfer）：使用卷积神经网络自动将某图像中的样式应用在另一图像之上。

内容损失（content loss）使合成图像与内容图像在内容特征上接近。

样式损失（style loss）令合成图像与样式图像在样式特征上接近

总变差损失（total variation loss）则有助于减少合成图像中的噪点。

卷积层参数使用预训练模型来提取特征。

第一和第三卷积层输出作为样式特征

第二卷积层输出作为内容特征。

模型参数为合成图像。

正向传播（实线）计算损失，反向传播（虚线）迭代模型参数。

拉姆矩阵：格拉姆矩阵（Gram matrix）XX⊤∈Rc×c中i行j列的元素xij即向量xi与xj的内积，它表达了通道i和通道j上样式特征的相关性。(假设该输出的样本数为1，通道数为c，高和宽分别为h和w，我们可以把输出变换成c行hw列的矩阵X。)

练习

选择不同的内容和样式层，输出有什么变化？
答：选择了最后一个卷积层[34]作为内容层，内容损失很快降低到0.86左右。输出图像肉眼看不出变换。样式和内容层使用了[2,7,12,14,16,21,23,25,28,30,32], [34]，迭代后，样式层损失仅降低到7左右。输出图像还是肉眼看不出差距。
调整损失函数中的权值超参数，输出是否保留更多内容或减少更多噪点？
答：直觉上，提高内容损失的权值超参数，可以让他更容易被惩罚，所以内容损失降的更低，从而保留更多的内容。由于不清楚如何辨别内容和噪点的数量多少，故无实验。
替换实验中的内容图像和样式图像，你能创作出更有趣的合成图像吗？
答：选了两张动漫人物图，合成结果基本上保留内容图像，可能风格一致吧，损失也很低。
选了一张动漫人物图和一张风景图，合成结果很多噪点，损失值很高。

最后更新于4年前

这有帮助吗？