LSTM详解及TensorFlow实现

前言

首先附上LSTM的标准讲解，里面包括了你们想要的那几幅图。

这篇文章主要是对我这两天看的LSTM做一个笔记，主要是要搞清楚输入、输出、网络创建，包括使用Tensorflow创建LSTM多层网络的代码。随意百度一下，关于LSTM的讲解实在太多了，但是感觉总是不清晰，可能是我没用心，就浮在表面看一下，因此把自己觉得懂了的部分记录一下，希望能给也在摸索的你一点帮助。老规矩，RNN–>LSTM

下面各个图中符号的意义（该图来源于网络）：

RNN

先上一个大家都放的图（该图来源于网络），来源于网络的图我都会标明，如有侵权，联系删除。

classic RNN

从我个人经验来看，像我一样非科班出生的，一般上来就是LSTM，其实搞明白了经典的RNN，LSTM也是一样的，下面我根据这个画一个容易理解的图（假设对传统的神经网络是了解的）：

这张图是根据自己的理解画的，如有协同，纯属巧合。本文中的RNN只采用两个网络层，各个维度我也标明了，其中$\omega_i$ 表示第$i$个RNN的权重，$h_{t}^{i}$ 表示第$t$时刻第$i$个RNN的隐含层神经元个数。我想这个图已经清楚地表达多层RNN是怎样前向传播的了。值得说明的是，一个RNN神经元就是一个全连接层，上面说的两层就是两个RNN神经元，再多层也是一样往后叠加，但是权重是共享的，也就是说上面两个RNN神经元，不管你的$t$运算多少次，都只有两个权重层。

LSTM

再来几张大家经常看到的LSTM图（以下五张图片来源于网络）

这个图是真不想画了，太复杂了，跟RNN是一样的，所以只挑重点，标注维度，这次是根据编程的来，批量训练，嘴上这么说，身体却很诚实，花了两个小时画了张图，好累：

图中$n_i$表示第$i$个LSTM神经元的隐含层神经元个数，$\omega_j^i$表示第$i$个LSTM神经元（第$i$层）中第$j$个位置（位置指的是”遗忘门、输出门等”）的权重，一个LSTM神经元中，总共有4个权重数组，他们的shape都为”[当前LSTM神经元输入维度+当前隐含层神经元个数，当前隐含层神经元个数]”，$b_j^i$表示第$i$个LSTM神经元中第$j$个位置（位置指的是”遗忘门、输出门等”）的偏置。输入$x_t$有两个维度，第一个维度$s$为batch_size，$m$为输入的维度，其实编程的时候，输入是有三个维度的，分别是(batch_size, seq_length, input_dim)，但是每个LSTM神经元在每个 t 时刻只接收(batch_size, input_dim)的数据，一共接收 seq_length 次。

再啰嗦一句，$h_{t}^1$，第1个神经元输出后有两个去处，一个是传递到当前时刻的下一个LSTM神经元作为输入，另一个是传回给下一个时刻的自己，作为$h_{t+1}^1$；$C_t^1$则只传回给下一个时刻的自己，作为$C_{t+1}^1$.

编程注意事项

使用TensorFlow API实现

首先给出参考的链接1，这个参考里的代码有一点点问题，就是他把seq_length(循环的次数)和input_dim(输入维度)搞反了

首先给出参考的链接2, 参考3，这个参考详细地讲解了tf.nn.dynamic_rnn()这个函数

首先给出参考的链接4，这个参考讲解了tf.nn.rnn_cell.MultiRNNCell()正确的编程方式

到此，曾经让我非常困惑的LSTM就讲完了，是不是很简单，下面是编程要注意的事项：

###################### 创建多层LSTM 网络 ################
# 第一个LSTM神经元隐含层20个神经元，第二个SLTM 隐含层有30个神经元
hidden_size = [20, 30]
# 使用多层的LSTM结构
cell = tf.nn.rnn_cell.MultiRNNCell([tf.nn.rnn_cell.BasicLSTMCell(i) for i in hidden_size])
# 使用Tensorflow 接口将多层的LSTM结构连接成RNN网络并计算其前向传播结果
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

# dynamic_rnn()注意点
outputs, states = tf.nn.dynamic_rnn(
    cell,
    inputs,
    sequence_length=None,
    initial_state=None,
    dtype=None,
    parallel_iterations=None,
    swap_memory=False,
    time_major=False,
    scope=None
)
# cell -- LSTM 神经元，可以是lstm, rnn, gru
# inputs -- 训练数据的输入，维度：(batch_size, seq_length, input_dim)
# time_major -- 默认false,inputs 和outputs 张量的形状格式。如果为True，则这些张量都应该是（都会是）[max_time, batch_size, hidden_size]。如果为false，则这些张量都应该是（都会是）[batch_size，max_time, hidden_size]。time_major=true说明输入和输出tensor的第一维是max_time。否则为batch_size。
# outputs -- 输出每个时刻的最后一个RNN(LSTM/GRU)神经元的输出，即图中的h，shape由time_major确定
# states-- states表示最终的状态，也就是序列中最后一个cell输出的状态。一般情况下states的形状为 [batch_size, cell.output_size ]，但当输入的cell为BasicLSTMCell时，state的形状为[2，batch_size, cell.output_size ]，其中2也对应着LSTM中的cell state和hidden state。(具体请查看参考链接2)

# 取最后一个时刻的输出连接到全连接层，注意outputs输出是h，而不是C
outputs = outputs[:, -1, :]
# 输出连接全连接层做预测，对lstm层的输出再加上上一个全连接层。
predictions = tf.contrib.layers.fully_connected(outputs, 1, activation_fn=None)

到这之后，拿到预测，怎么设计 loss functions 由自己做了。

最后再啰嗦一句，就是tf.nn.rnn_cell.MultiRNNCell()函数的用法，因为我也确实有看到有人这样编程，根据参考4，这是有问题的，因为这样的多层权重是共享的，估计效果就不会这么好了吧

# 不提倡： 每个LSTM的隐含层神经元是一样的，且权重是共享的
basic_cell = tf.nn.rnn_cell.BasicLSTMCell(rnn_unit)
multi_cell = tf.nn.rnn_cell.MultiRNNCell([basic_cell]*layer_num)

Tensorflow 官网提倡这样写：

# num_units 里存放各个LSTM神经元的隐含层个数
num_units = [128, 64]
cells = tf.nn.rnn_cell.BasicLSTMCell[(num_units=n) for n in num_units]
stacked_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(cells)

手动实现

下面的实现是不直接使用TensorFlow LSTM API 实现的LSTM，需要注意的是，LSTM网络大家一般在time_step=0时采用0矩阵作为输入（即H，C在t0初始化为不可训练的全0矩阵），而我这里用了一些可以训练的权重矩阵: 也即图中的红色部分，一般来说会直接用不可训练的0矩阵作为输入

下面的代码基于：

TF: 1.13.1

python: 3.6.8

Platform: Windows10 + Anaconda + Pycharm

# Author: 烟酒僧
# Date: 2020-03-29
# Function: Realize LSTM NN through basic TF API
import tensorflow as tf
import numpy as np
class LSTM(object):
    def __init__(self, input_dim):
        # input_dim is the dimension of inputs
        self.input_dim = input_dim
        self.width = [input_dim, 32, 32, 64]
        # time step of LSTM
        self.time_step = 30
        self.lstm_init()
        output = self.encoder_lstm()

    def lstm_init(self):
        '''
        initialize all weights
        :return: None
        '''
        self.x = tf.placeholder(tf.float32, [self.time_step, None, self.input_dim])
        widthes = self.width
        weights = dict()
        biases = dict()
        # weights for h0 and c0
        for i in np.arange(len(widthes) - 1):
            weights['WEH%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEH%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['WEC%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEC%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
        # weights for LSTM
        for i in np.arange(len(widthes) - 1):
            weights['LSTMWf%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWf%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWi%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWi%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWc%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWc%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWo%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWo%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            biases['LSTMbf%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbf%d' % (i + 1))
            biases['LSTMbi%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbi%d' % (i + 1))
            biases['LSTMbc%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbc%d' % (i + 1))
            biases['LSTMbo%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbo%d' % (i + 1))
        self.weights.update(weights)
        self.biases.update(biases)

    def encoder_lstm(self, args):
        # self.x = tf.placeholder(tf.float32, [args.num_shifts + 1, None, args.input_dim])
        input_h = tf.squeeze(self.x[0, :, :])
        input_c = tf.squeeze(self.x[0, :, :])
        widthes = self.width
        num_lstm = len(widthes) - 1
        h, c = [], []
        for i in np.arange(num_lstm):
            input_h = tf.nn.tanh(tf.matmul(input_h, self.weights['WEH%d' % (i + 1)]))
            h.append(input_h)
            input_c = tf.nn.tanh(tf.matmul(input_c, self.weights['WEC%d' % (i + 1)]))
            c.append(input_c)
        output = []
        for i in np.arange(self.time_step):
            x = tf.squeeze(self.x[i, :, :])
            # multi-LSTM
            h, c = self.lstm_one_shift(x, h, c, num_lstm)
            output.append(h[-1])
        return output

    def lstm_one_shift(self, x, h, c, num_lstm):
        '''
        forward one time step for multi-lstm
        :param x: the input, this is the hidden layer of the last LSTM unit
        :param h: the hidden layer of the same LSTM of the last time step
        :param c: the output layer of the same LSTM of the last time step
        :param num_lstm: the number of the LSTM NN
        :return:
        '''
        for j in np.arange(num_lstm):
            temp_h, temp_c = self.lstm_unit(x, h[j], c[j], j + 1)
            h[j] = temp_h
            c[j] = temp_c
            x = h[j]
        return h, c

    def lstm_unit(self, x, h, c, k):
        '''
        It is for calculating a single LSTM unit. n1, n2 denotes neural number of the last and current LSTM hidden layer
        :param x: the input, shape = (b, m), where b is the batch size, m is the dim of the input or the dim of the last hidden layer
        :param h: the hidden output of the last LSTM, shape = (b, n1)
        :param c: the output of the last LSTM, shape = (b, n1)
        :param k: k denotes that we can use the weights and biases of the k-th LSTM units
        :return: h1 = (b, n2), c1 = (b, n2)
        '''
        input = tf.concat([h, x], axis=1)
        f = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWf%d' % k], self.biases['LSTMbf%d' % k]))
        i = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWi%d' % k], self.biases['LSTMbi%d' % k]))
        C = tf.nn.tanh(tf.nn.xw_plus_b(input, self.weights['LSTMWc%d' % k], self.biases['LSTMbc%d' % k]))
        c = tf.multiply(f, c) + tf.multiply(i, C)
        O = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWo%d' % k], self.biases['LSTMbo%d' % k]))
        h = tf.multiply(O, tf.nn.tanh(C))
        return h, c

    def lstm_init(self, args):
        self.x = tf.placeholder(tf.float32, [self.time_step, None, self.input_dim])
        widthes = self.width
        weights = dict()
        biases = dict()
        # weights for h0 and c0
        for i in np.arange(len(widthes) - 1):
            weights['WEH%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEH%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['WEC%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEC%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
        # weights for LSTM
        for i in np.arange(len(widthes) - 1):
            weights['LSTMWf%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWf%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWi%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWi%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWc%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWc%d' % (i + 1), distribution='dl',
scale=1 / widthes[i + 1])
            weights['LSTMWo%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWo%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            biases['LSTMbf%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbf%d' % (i + 1))
            biases['LSTMbi%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbi%d' % (i + 1))
            biases['LSTMbc%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbc%d' % (i + 1))
            biases['LSTMbo%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbo%d' % (i + 1))
        self.weights.update(weights)
        self.biases.update(biases)

    def weight_variable(self, shape, var_name, distribution='tn', scale=0.1, first_guess=0):
        """Create a variable for a weight matrix.
        Arguments:
            shape -- array giving shape of output weight variable
            var_name -- string naming weight variable
            distribution -- string for which distribution to use for random initialization (default 'tn')
            scale -- (for tn distribution): standard deviation of normal distribution before truncation (default 0.1)
            first_guess -- (for tn distribution): array of first guess for weight matrix, added to tn dist. (default 0)

        Returns:
            a TensorFlow variable for a weight matrix
        Raises ValueError if distribution is filename but shape of data in file does not match input shape
        """
        if distribution == 'tn':
            initial = tf.truncated_normal(shape, stddev=scale, dtype=tf.float32)
        elif distribution == 'xavier':
            scale = 4 * np.sqrt(6.0 / (shape[0] + shape[1]))
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        elif distribution == 'dl':
            # see page 295 of Goodfellow et al's DL book
            # divide by sqrt of m, where m is number of inputs
            scale = 1.0 / np.sqrt(shape[0])
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        elif distribution == 'he':
            # from He, et al. ICCV 2015 (referenced in Andrew Ng's class)
            # divide by m, where m is number of inputs
            scale = np.sqrt(2.0 / shape[0])
            initial = tf.random_normal(shape, mean=0, stddev=scale, dtype=tf.float32)
        elif distribution == 'glorot_bengio':
            # see page 295 of Goodfellow et al's DL book
            scale = np.sqrt(6.0 / (shape[0] + shape[1]))
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        else:
            initial = np.loadtxt(distribution, delimiter=',', dtype=np.float32)
            if (initial.shape[0] != shape[0]) or (initial.shape[1] != shape[1]):
                raise ValueError(
                    'Initialization for %s is not correct shape. Expecting (%d,%d), but find (%d,%d) in %s.' % (
                        var_name, shape[0], shape[1], initial.shape[0], initial.shape[1], distribution))
        return tf.Variable(initial, name=var_name)

    def bias_variable(self, shape, var_name, distribution=''):
        """Create a variable for a bias vector.
        Arguments:
            shape -- array giving shape of output bias variable
            var_name -- string naming bias variable
            distribution -- string for which distribution to use for random initialization (file name) (default '')
        Returns:
            a TensorFlow variable for a bias vector
        """
        if distribution:
            initial = np.genfromtxt(distribution, delimiter=',', dtype=np.float32)
        else:
            initial = tf.constant(0.0, shape=shape, dtype=tf.float32)
        return tf.Variable(initial, name=var_name)

其实还有另一种形式，是我看到有人这样做，github也有代码，不是用tensorflow等直接调用API，而是手动定义权重数组用矩阵乘法进行编程的。

网络结构是一样的，和上面的区别是这种形式不是简单地进行串联（参考上面的图），而是每个t-1时刻到t时刻的隐含层之间用权重来连接，这样的话就相当于每层（每个LSTM神经元）都多了4个权重数组。但是我经过调试，如果采用我上面的编程方式，直接用tensorflow来构建LSTM网络，tensorflow里并不是采用的下面的方法 $\begin{array}{c} \,\,\text{InputGate: } i_t=\sigma \left( W_{xi}x_t+W_{hi}h_{t-1}+b_i \right)\\ \text{ForgetGate: }f_t=\sigma \left( W_{xf}x_t+W_{hf}h_{t-1}+b_f \right)\\ \,\,\text{OutputGate: }o_t=\sigma \left( W_{xo}x_t+W_{ho}h_{t-1}+b_o \right)\\ \,\,\text{Input ModulationGate: }g_t=\tan\text{h}\left( W_{xc}x_t+W_{hc}h_{t-1}+b_c \right)\\ c_t=f_t\otimes c_{t-1}+i_t\otimes g_t\\ h_t=o_t\otimes \tan\text{h}\left( c_t \right)\\ \end{array}$

更多内容可以访问知乎主页

多层LSTM使用多TF API和手动同时实现

前言

RNN

LSTM

编程注意事项

使用TensorFlow API实现

手动实现

CATALOG

FEATURED TAGS