实现我们的神经网络来分类数字

好吧，现在让我们写一个学习怎么样识别手写数字的程序，使用随机梯度下降法和MNIST训练数据。我们需要做的第一件事情是获取MNIST数据。如果你是一个git用户，那么你能够通过克隆这本书的代码仓库获得数据，

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

如果你不使用git，那么你能够在这里下载数据和代码。

顺便说一下，当我在之前描述MNIST数据时，我说它分成了60,000个训练图像和10,000个测试图像。这是官方的MNIST的描述。实际上，我们将用稍微不同的方法对数据进行分割。我们将测试集保持原样，但是将60,000个图像的MNIST训练集分成两个部分：一部分50,000个图像，我们将用来训练我们的神经网络，和一个单独的10,000个图像的验证集（validation set）。在本章中我们不使用验证数据，但是在本书的后面我们将会发现它对于解决如何去设置神经网络中的超参数（hyper-parameter）——例如学习率等不是被我们的学习算法直接选择的参数——是很有用的。尽管验证数据集不是原始MNIST规范的一部分，然而许多人使用以这种方式使用MNIST，并且在神经网络中使用验证数据是很常见的。当我从现在起提到「MNIST训练数据」，我指的不是原始的60,000图像数据集，而是我们的50,000图像数据集¹。

¹ 如前所述，MNIST数据集是基于NIST（美国国家标准与技术研究院）收集的两个数据集合。为了构建MNIST，NIST数据集合被Yann LeCun，Corinna Cortes和Christopher J. C. Burges拆分放入一个更方便的格式。更多细节请看这个链接。我的仓库中的数据集是在一种更容易在Python中加载和操纵MNIST数据的形式。我从蒙特利尔大学的LISA机器学习实验室获得了这个特殊格式的数据（链接）。

除了MNIST数据之外，我们还需要一个叫做Numpy的用于处理快速线性代数的Python库。如果你没有安装过Numpy，你能够在这里获得。

在给出一个完整的清单之前，让我解释一下神经网络代码的核心特征，如下。核心是一个Network类，我们用来表示一个神经网络。这是我们用来初始化一个Network对象的代码:

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])]

在这段代码中，列表sizes包含各层的神经元的数量。因此举个例子，如果我们想创建一个在第一层有2个神经元，第二层有3个神经元，最后层有1个神经元的network对象，我们应这样写代码：

net = Network([2, 3, 1])

Network对象的偏差和权重都是被随机初始化的，使用Numpy的np.random.randn函数来生成均值为0，标准差为1的高斯分布。随机初始化给了我们的随机梯度下降算法一个起点。在后面的章节中我们将会发现更好的初始化权重和偏差的方法，但是现在将采用随机初始化。注意Network初始化代码假设第一层神经元是一个输入层，并对这些神经元不设置任何偏差，因为偏差仅在之后的层中使用。

同样注意，偏差和权重以列表存储在Numpy矩阵中。因此例如net.weights[1]是一个存储着连接第二层和第三层神经元权重的Numpy矩阵。（不是第一层和第二层，因为Python列中的索引从0开始。）因此net.weights[1]相当冗长，让我们就这样表示矩阵 $w$ 。矩阵中的 $w_{jk}$ 是连接第二层的 $k^{\rm th}$ 神经元和第三层的 $j^{\rm th}$ 神经元的权重。这种 $j$ 和 $k$ 索引的顺序可能看着奇怪，当然交换 $j$ 和 $k$ 索引会更有意义？使用这种顺序的很大的优势是它意味着第三层神经元的激活向量是

$\begin{eqnarray} a' = \sigma(w a + b). \tag{22}\end{eqnarray}$

这个等式有点复杂，所以让我们一块一块地进行理解。 $a$ 是第二层神经元的激活向量。为了得到 $a'$ ，我们用权重矩阵 $w$ 乘以 $a$ ，加上偏差向量 $b$ ，我们然后对向量 $w a +b$ 中的每个元素应用函数 $\sigma$ 。（这被叫做函数 $\sigma$ 的向量化（vectorizing）。）很容易验证等式（22）给出了跟我们之前的计算一个sigmoid神经元的输出的等式（4）的结果相同。

练习

写出等式（22）的组成形式，并验证它跟我们之前的计算一个sigmoid神经元的输出的等式（4）的结果相同。

有了这些，很容易写从一个Network实例计算输出的代码。我们从定义sigmoid函数开始：

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

注意，当输入 $z$ 是一个向量或者Numpy数组时，Numpy自动的应用元素级的sigmoid函数，也就是向量化。

我们然后对Network类添加一个前馈方法，对于网络给定一个输入 $a$ ，返回对应的输出²。这个方法所做的是对每一层应用等式（22）：

² 假定输入a时Numpy的n维数组(n,1),而不是向量(n,)。这里，n是输入到网络的数目。如果你试着用一个向量(n,)作为输入，将会得到奇怪的结果。虽然使用向量(n,)看上去好像是更自然的选择，但是使用n维数组(n,1)使它特别容易的修改代码来立刻前馈多层输入，并且有的时候这很方便。

def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

当然，我们想要我们的Network对象做的主要事情是学习。为此我们给它们一个实现随机梯度下降的SGD方法。下面是这部分的代码。在一些地方有一点神秘，我会在下面将它分解。

def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The "training_data" is a list of tuples
        "(x, y)" representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If "test_data" is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

training_data是一个代表着训练输入和对应的期望输出的元组（x,y）的列表。变量epochs和mini_batch_size是你期望的训练的迭代次数和取样时所用的mini-batch块的大小。eta是学习率 $\eta$ 。如果可选参数test_data被提供，那么程序将会在每次训练迭代之后评价网络，并输出我们的局部进展。这对于跟踪进展是有用的，但是大幅度降低速度。

代码如下工作。在每次迭代，它首先随机的将训练数据打乱，然后将它分成适当大小的迷你块。这是一个简单的从训练数据的随机采样方法。然后对于每一个mini_batch我们应用一次梯度下降。这是通过代码self.update_mini_batch(mini_batch, eta)做的，仅仅使用mini_batch上的训练数据，根据单轮梯度下降更新网络的权重和偏差。这是update_mini_batch方法的代码：

 def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

大部分工作通过这行所做

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

这行调用了一个叫做叫反向传播（backpropagation）的算法，这是一种快速计算代价函数的梯度的方法。因此update_mini_batch的工作仅仅是对mini_batch中的每一个训练样例计算梯度，然后适当的更新self.weights和self.biases。

我现在不会展示self.backprop的代码。我们将在下章中学习反向传播是怎样工作的，包括self.backprop的代码。现在，就假设它可以如此工作，返回与训练样例x相关代价的适当梯度。

让我们看一下完整的程序，包括我之前忽略的文档注释。除了self.backprop，程序是不需加以说明的。我们已经讨论过，所有重要的部分都在self.SGD和self.update_mini_batch中完成。self.backprop方法利用一些额外的函数来帮助计算梯度，也就是说sigmoid_prime是计算 $\sigma$ 函数的导数，而我不会在这里描述self.cost_derivative。你能够通过查看代码或文档注释来获得这些的要点（以及细节）。我们将在下章详细的看一下它们。注意，虽然程序显得很长，但是大部分代码是用来使代码更容易理解的文档注释。实际上，程序只包含74行非空行、非注释代码。所有的代码可以在GitHub这里找到。

"""
network.py
\~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

程序识别手写数字的效果如何？好吧，让我们先加载MNIST数据。我将用下面所描述的一小段辅助程序mnist_loader.py来完成。我们在一个Python shell中执行下面的命令，

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()

当然，这也可以被做成一个单独的Python程序，但在Python shell执行最方便。

在加载完MNIST数据之后，我们将设置一个有 $30$ 个隐层神经元的Network。我们在导入如上所列的名字为network的Python程序后做，

>>> import network
>>> net = network.Network([784, 30, 10])

最后，我们将使用随机梯度下降来从MNISTtraining_data学习超过30次迭代，迷你块大小为10，学习率 $\eta = 3.0$ ，

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

注意，如果当你阅读至此的时候正在运行代码，执行将会花费一些时间，对于一个普通的机器（截至2015年），它可能将会花费几分钟来运行。我建议你让它运行，继续阅读并定期的检查一下代码的输出。如果你赶时间，你可以通过减少迭代次数，减少隐层神经元次数或仅使用部分训练数据来提高速度。注意，这样产生的代码将会特别快：这些Python脚本的目的是帮助你理解神经网络是如何工作的，而不是高性能的代码！而且，当然，一旦我们已经训练一个网络，它能在几乎任何的计算平台上快速的运行。例如，一旦我们对于一个网络学会了一组好的权重集和偏置集，它能很容易的移植到web浏览器中以Javascript运行，或者如在移动设备上的本地应用。在任何情况下，这是一个神经网络训练运行时的部分输出文字记录。记录显示了在每轮训练之后神经网络能正确识别测试图像的数目。正如你所见到，在仅仅一次迭代后，达到了10,000中选中9,129个。而且数目还在持续增长，

Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000

经过训练的网络给我们一个一个约 $95$ %分类正确率为，在峰值时为 $95.42$ %（"Epoch 28"）！作为第一次尝试，这是非常鼓舞人心的。然而我应该提醒你，如果你运行代码然后得到的结果不一定和我的完全一样，因为我们使用了（不同的）随机权重和偏置来初始化我们的网络。我采用了三次运行中的最优结果作为本章的结果。

让我们重新运行上面的实验，将隐层神经元数目改到100。正如前面的情况，如果你一边阅读一边运行代码，你应该被警告它将会花费相当长一段时间来执行（在我的机器上，这个实验每一轮训练迭代需要几十秒），因此当代码运行时，并行的继续阅读是很明智的。

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

果然，它将结果提升至96.59%。至少在这种情况下，使用更多的隐层神经元帮助我们得到了更好的结果³。

³ 读者的反馈表明本实验在结果上有相当多的变化，而且一些训练运行给出的结果相当糟糕。使用第三章所介绍的技术将对于我们的网络在不同的训练执行上大大减少性能变化。

当然，为了获得这些正确率，我不得不对于训练的迭代次数，mini-batch大小和学习率 $\eta$ 做了特殊的选择。正如我上面所提到的，这些在我们的神经网络中被称为超参数，以区别于通过我们的学习算法所学到的参数（权重和偏置）。如果我们较差的选择了超参数，我们会得到较差的结果。假设，例如我们选定学习率为 $\eta = 0.001$ ,

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)

结果不太令人激励，

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000

然而，你能够看到网络的性能随着时间的推移在缓慢的变好。这意味着着我们应该增大学习率，例如 $\eta = 0.01$ 。如果我们那样做了，我们会得到更好的结果，意味着我们应该再次增加学习率。（如果改变能够提高，试着做更多！）如果我们这样做几次，我们最终会得到一个像 $\eta = 1.0$ 的学习率（或者调整到 $3.0$ ），这跟我们之前的实验很接近。因此即使我们最初较差的选择了超参数，我们至少获得了足够的信息来帮助我们提升对于超参数的选择。

一般来说，调试一个神经网络是具有挑战性的。甚至有可能某种超参数的选择所产生的分类结果还不如随机分类。假定我们从之前成功的构建了30个隐层神经元的网络结构，但是学习率被改为 $\eta = 100.0$ :

>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)

在这点上，我们实际走的太远，学习率太高了：

Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000

现在想象一下，我们第一次遇到这样的问题。当然，我们从我们之前的实验中知道正确的做法是减小学习率。但是如果我们第一次遇到这样的问题，然而没有太多的输出来指导我们怎么做。我们可能不仅关心学习率，还要关心我们的神经网络中的其它每一个部分。我们可能想知道是否选择了让网络很难学习的初始化的权重和偏置？或者可能我们没有足够的训练数据来获得有意义的学习？或者我们没有进行足够的迭代次数？或者可能对于这种神经网络的结构，学习识别手写数字是不可能的？可能学习率太低？或者可能学习率太高？当你第一次遇到某问题，你通常抱有不了把握。

从这得到的教训是调试一个神经网络是一个琐碎的工作，就像日常的编程一样，它是一门艺术。你需要学习调试的艺术来获得神经网络更好的结果。更普通的是，我们需要启发式方法来选择好的超参数和好的结构。所有关于这些的内容，我们都会在本书中进行讨论，包括我之前是怎么样选择超参数的。

练习

尝试创建只有两层的神经网络，一个784个神经元的输入层和一个10个神经元的输出层，没有隐含层。用随机梯度下降法来训练这个网络。你能取得多高的分类精度？

早些时候，我跳过了MNIST数据时如何被加载的细节。它相当的简单。为了完整性，这是代码。被用于存储MNIST数据的数据结构在文档注释中被说明。这是简单明了的事情，由Numpy的ndarray对象构成的元组和列表（如果你不熟悉ndarrays，将它们认为向量）：

"""
mnist_loader
\~~~~~~~~~~~~

A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.

    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.

    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.

    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.

    This is a nice data format, but for use in neural networks it's
    helpful to modify the format of the ``training_data`` a little.
    That's done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.

    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.

    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.

    Obviously, this means we're using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

我之前说过我们的程序得到了很好的结果。这意味着什么？和什么比较很好？和一些简单的（非神经网络）baseline相比是非常有意义的，可以来理解什么样意味着表现好。当然，所有基准中最简单的是去随机的猜测数字，准确率大约是10%，我们做的比这好太多。

一个小的微不足道的baseline怎么样？让我们尝试一个极其简单的想法：我们来看看图片是如何的黑暗。例如一个 $2$ 的图片显然比一个 $1$ 的图片更黑，只是因为像下面示例中更多的像素点被涂黑：

这表明使用训练数据来对每个数字 $0, 1, 2,\ldots, 9$ 计算平均暗度。当面对一个新的图像，我们计算这个图像的暗度是多少，然后再猜测它最近哪个数字的平均暗度。这是一个很简单的程序，很容易编写，因此我不明确的写出代码。如果你感兴趣，它在GitHub仓库。但是这是相比于随机猜测的一个大的提升，在 $10,000$ 个测试图像中识别正确 $2,225$ 个，也就是 $22.25$ %的准确率。

找到能够准确率达到 $20$ %到 $50$ %范围的想法并不困难。如果你努力一点，你能达到 $50$ %以上。但是使用已有的机器学习算法能帮助你达到更高的准确率。让我们尝试使用最出名的机器学习算法之一，支持向量机(support vector machine，SVM)。如果你不熟悉SVM，不用担心，我们不需要了解SVM具体是怎么工作的。我们而是使用一个叫做scikit-learn的Python库，它提供一个被称为LIBSVM的基于C的快速SVM库的简单的Python接口。

如果我们用默认的设置来运行scikit-learn的SVM分类器，那么它会在 $10,000$ 测试图像中正确识别 $9,435$ 。（这里的代码是可用的。）这是相比于我们的朴素的基于图片暗度的分类方法有着巨大的提升。实际上，这意味着SVM与我们的神经网络表现接近，只差了一点。在后面的章节中，我们将介绍新的技术来提升我们的神经网络使得它比SVM表现的好更多。

然而这并不是故事的结尾。在scikit-learn中对于SVM的默认设置的结果是 $10,000$ 中的 $9,435$ 。SVM有大量的可调参数，而且可以搜索到能够取得更高准确率的参数。我不会做这个探究，但是如果你想了解更多的话，请你留意Andreas Mueller的这篇博客。Mueller对SVM的一些参数进行优化，取得 $98.5$ %的准确率。换句话说，一个精心调参后的SVM仅仅对一个数字错误识别了70次。这个结果相当不错了！神经网络可以做得更好吗？

实际上，神经网络可以做的更好。目前，解决MNIST数字识别问题上，一个精心设计的神经网络能够比其它任何技术（包括SVM）取得更好的结果。当前（2013年）的最高纪录是10,000个中正确识别了9,979个。这个纪录是由Li Wan，Matthew Zeiler，Sixin Zhang，Yann LeCun和Rob Fergus创造的。我们在本书的后面部分中会看到大多数他们所采用的技术。这个性能表现已经与人类的水平接近，甚至更好，因为相当多的MNIST图像对于人类来说是很难有信心识别的，例如：

我相信你会同意这些是很难进行分类的！值得注意的是，在拥有像这样图像的MNIST数据集中，神经网络能够对于 $10,000$ 个测试图像除了 $21$ 个外都能正确分类。通常，我们认为像解决识别MNIST数字的复杂问题的程序需要一个复杂的算法。但尽管在Wan等人的论文中提及的神经网络和我们本章中所见到的算法有一些变化，但是也相当简单。所有复杂的事情都是可以从训练数据中自动学习的。在某种意义上，我们的结果和那些复杂论文中的结果表明了，在一些问题上：

复杂算法 $\leq$ 简单学习算法 + 好的训练数据

实现我们的神经网络来分类数字

实现我们的神经网络来分类数字

练习

练习

results matching ""

No results matching ""