LSTM详解及TensorFlow实现

多层LSTM使用多TF API和手动同时实现

Posted by XYQ on March 29, 2020

前言

首先附上LSTM标准讲解,里面包括了你们想要的那几幅图。

这篇文章主要是对我这两天看的LSTM做一个笔记,主要是要搞清楚输入、输出、网络创建,包括使用Tensorflow创建LSTM多层网络的代码。随意百度一下,关于LSTM的讲解实在太多了,但是感觉总是不清晰,可能是我没用心,就浮在表面看一下,因此把自己觉得懂了的部分记录一下,希望能给也在摸索的你一点帮助。老规矩,RNN–>LSTM

下面各个图中符号的意义(该图来源于网络):

RNN

先上一个大家都放的图(该图来源于网络),来源于网络的图我都会标明,如有侵权,联系删除。

classic RNN

从我个人经验来看,像我一样非科班出生的,一般上来就是LSTM,其实搞明白了经典的RNN,LSTM也是一样的,下面我根据这个画一个容易理解的图(假设对传统的神经网络是了解的):

这张图是根据自己的理解画的,如有协同,纯属巧合。本文中的RNN只采用两个网络层,各个维度我也标明了,其中$\omega_i$ 表示第$i$个RNN的权重,$h_{t}^{i}$ 表示第$t$时刻第$i$个RNN的隐含层神经元个数。我想这个图已经清楚地表达多层RNN是怎样前向传播的了。值得说明的是,一个RNN神经元就是一个全连接层,上面说的两层就是两个RNN神经元,再多层也是一样往后叠加,但是权重是共享的,也就是说上面两个RNN神经元,不管你的$t$运算多少次,都只有两个权重层。

LSTM

再来几张大家经常看到的LSTM图(以下五张图片来源于网络)

这个图是真不想画了,太复杂了,跟RNN是一样的,所以只挑重点,标注维度,这次是根据编程的来,批量训练,嘴上这么说,身体却很诚实,花了两个小时画了张图,好累:

图中$n_i$表示第$i$个LSTM神经元的隐含层神经元个数,$\omega_j^i$表示第$i$个LSTM神经元(第$i$层)中第$j$个位置(位置指的是”遗忘门、输出门等”)的权重,一个LSTM神经元中,总共有4个权重数组,他们的shape都为”[当前LSTM神经元输入维度+当前隐含层神经元个数,当前隐含层神经元个数]”,$b_j^i$表示第$i$个LSTM神经元中第$j$个位置(位置指的是”遗忘门、输出门等”)的偏置。输入$x_t$有两个维度,第一个维度$s$为batch_size,$m$为输入的维度,其实编程的时候,输入是有三个维度的,分别是(batch_size, seq_length, input_dim),但是每个LSTM神经元在每个 t 时刻只接收(batch_size, input_dim)的数据,一共接收 seq_length 次。

再啰嗦一句,$h_{t}^1$,第1个神经元输出后有两个去处,一个是传递到当前时刻的下一个LSTM神经元作为输入,另一个是传回给下一个时刻的自己,作为$h_{t+1}^1$;$C_t^1$则只传回给下一个时刻的自己,作为$C_{t+1}^1$.

编程注意事项

使用TensorFlow API实现

首先给出参考的链接1,这个参考里的代码有一点点问题,就是他把seq_length(循环的次数)和input_dim(输入维度)搞反了

首先给出参考的链接2, 参考3,这个参考详细地讲解了tf.nn.dynamic_rnn()这个函数

首先给出参考的链接4,这个参考讲解了tf.nn.rnn_cell.MultiRNNCell()正确的编程方式

到此,曾经让我非常困惑的LSTM就讲完了,是不是很简单,下面是编程要注意的事项:

###################### 创建多层LSTM 网络 ################
# 第一个LSTM神经元隐含层20个神经元,第二个SLTM 隐含层有30个神经元
hidden_size = [20, 30]
# 使用多层的LSTM结构
cell = tf.nn.rnn_cell.MultiRNNCell([tf.nn.rnn_cell.BasicLSTMCell(i) for i in hidden_size])
# 使用Tensorflow 接口将多层的LSTM结构连接成RNN网络并计算其前向传播结果
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
# dynamic_rnn()注意点
outputs, states = tf.nn.dynamic_rnn(
    cell,
    inputs,
    sequence_length=None,
    initial_state=None,
    dtype=None,
    parallel_iterations=None,
    swap_memory=False,
    time_major=False,
    scope=None
)
# cell -- LSTM 神经元,可以是lstm, rnn, gru
# inputs -- 训练数据的输入,维度:(batch_size, seq_length, input_dim)
# time_major -- 默认false,inputs 和outputs 张量的形状格式。如果为True,则这些张量都应该是(都会是)[max_time, batch_size, hidden_size]。如果为false,则这些张量都应该是(都会是)[batch_size,max_time, hidden_size]。time_major=true说明输入和输出tensor的第一维是max_time。否则为batch_size。
# outputs -- 输出每个时刻的最后一个RNN(LSTM/GRU)神经元的输出,即图中的h,shape由time_major确定
# states-- states表示最终的状态,也就是序列中最后一个cell输出的状态。一般情况下states的形状为 [batch_size, cell.output_size ],但当输入的cell为BasicLSTMCell时,state的形状为[2,batch_size, cell.output_size ],其中2也对应着LSTM中的cell state和hidden state。(具体请查看参考链接2)
# 取最后一个时刻的输出连接到全连接层,注意outputs输出是h,而不是C
outputs = outputs[:, -1, :]
# 输出连接全连接层做预测,对lstm层的输出再加上上一个全连接层。
predictions = tf.contrib.layers.fully_connected(outputs, 1, activation_fn=None)

到这之后,拿到预测,怎么设计 loss functions 由自己做了。

最后再啰嗦一句,就是tf.nn.rnn_cell.MultiRNNCell()函数的用法,因为我也确实有看到有人这样编程,根据参考4,这是有问题的,因为这样的多层权重是共享的,估计效果就不会这么好了吧

# 不提倡: 每个LSTM的隐含层神经元是一样的,且权重是共享的
basic_cell = tf.nn.rnn_cell.BasicLSTMCell(rnn_unit)
multi_cell = tf.nn.rnn_cell.MultiRNNCell([basic_cell]*layer_num)

Tensorflow 官网提倡这样写:

# num_units 里存放各个LSTM神经元的隐含层个数
num_units = [128, 64]
cells = tf.nn.rnn_cell.BasicLSTMCell[(num_units=n) for n in num_units]
stacked_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(cells)

手动实现

下面的实现是不直接使用TensorFlow LSTM API 实现的LSTM,需要注意的是,LSTM网络大家一般在time_step=0时采用0矩阵作为输入(即H,C在t0初始化为不可训练的全0矩阵),而我这里用了一些可以训练的权重矩阵: 也即图中的红色部分,一般来说会直接用不可训练的0矩阵作为输入

下面的代码基于:

TF: 1.13.1

python: 3.6.8

Platform: Windows10 + Anaconda + Pycharm

# Author: 烟酒僧
# Date: 2020-03-29
# Function: Realize LSTM NN through basic TF API
import tensorflow as tf
import numpy as np
class LSTM(object):
    def __init__(self, input_dim):
        # input_dim is the dimension of inputs
        self.input_dim = input_dim
        self.width = [input_dim, 32, 32, 64]
        # time step of LSTM
        self.time_step = 30
        self.lstm_init()
        output = self.encoder_lstm()

    def lstm_init(self):
        '''
        initialize all weights
        :return: None
        '''
        self.x = tf.placeholder(tf.float32, [self.time_step, None, self.input_dim])
        widthes = self.width
        weights = dict()
        biases = dict()
        # weights for h0 and c0
        for i in np.arange(len(widthes) - 1):
            weights['WEH%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEH%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['WEC%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEC%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
        # weights for LSTM
        for i in np.arange(len(widthes) - 1):
            weights['LSTMWf%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWf%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWi%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWi%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWc%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWc%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWo%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWo%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            biases['LSTMbf%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbf%d' % (i + 1))
            biases['LSTMbi%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbi%d' % (i + 1))
            biases['LSTMbc%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbc%d' % (i + 1))
            biases['LSTMbo%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbo%d' % (i + 1))
        self.weights.update(weights)
        self.biases.update(biases)

    def encoder_lstm(self, args):
        # self.x = tf.placeholder(tf.float32, [args.num_shifts + 1, None, args.input_dim])
        input_h = tf.squeeze(self.x[0, :, :])
        input_c = tf.squeeze(self.x[0, :, :])
        widthes = self.width
        num_lstm = len(widthes) - 1
        h, c = [], []
        for i in np.arange(num_lstm):
            input_h = tf.nn.tanh(tf.matmul(input_h, self.weights['WEH%d' % (i + 1)]))
            h.append(input_h)
            input_c = tf.nn.tanh(tf.matmul(input_c, self.weights['WEC%d' % (i + 1)]))
            c.append(input_c)
        output = []
        for i in np.arange(self.time_step):
            x = tf.squeeze(self.x[i, :, :])
            # multi-LSTM
            h, c = self.lstm_one_shift(x, h, c, num_lstm)
            output.append(h[-1])
        return output

    def lstm_one_shift(self, x, h, c, num_lstm):
        '''
        forward one time step for multi-lstm
        :param x: the input, this is the hidden layer of the last LSTM unit
        :param h: the hidden layer of the same LSTM of the last time step
        :param c: the output layer of the same LSTM of the last time step
        :param num_lstm: the number of the LSTM NN
        :return:
        '''
        for j in np.arange(num_lstm):
            temp_h, temp_c = self.lstm_unit(x, h[j], c[j], j + 1)
            h[j] = temp_h
            c[j] = temp_c
            x = h[j]
        return h, c

    def lstm_unit(self, x, h, c, k):
        '''
        It is for calculating a single LSTM unit. n1, n2 denotes neural number of the last and current LSTM hidden layer
        :param x: the input, shape = (b, m), where b is the batch size, m is the dim of the input or the dim of the last hidden layer
        :param h: the hidden output of the last LSTM, shape = (b, n1)
        :param c: the output of the last LSTM, shape = (b, n1)
        :param k: k denotes that we can use the weights and biases of the k-th LSTM units
        :return: h1 = (b, n2), c1 = (b, n2)
        '''
        input = tf.concat([h, x], axis=1)
        f = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWf%d' % k], self.biases['LSTMbf%d' % k]))
        i = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWi%d' % k], self.biases['LSTMbi%d' % k]))
        C = tf.nn.tanh(tf.nn.xw_plus_b(input, self.weights['LSTMWc%d' % k], self.biases['LSTMbc%d' % k]))
        c = tf.multiply(f, c) + tf.multiply(i, C)
        O = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWo%d' % k], self.biases['LSTMbo%d' % k]))
        h = tf.multiply(O, tf.nn.tanh(C))
        return h, c

    def lstm_init(self, args):
        self.x = tf.placeholder(tf.float32, [self.time_step, None, self.input_dim])
        widthes = self.width
        weights = dict()
        biases = dict()
        # weights for h0 and c0
        for i in np.arange(len(widthes) - 1):
            weights['WEH%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEH%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['WEC%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEC%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
        # weights for LSTM
        for i in np.arange(len(widthes) - 1):
            weights['LSTMWf%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWf%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWi%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWi%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWc%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWc%d' % (i + 1), distribution='dl',
scale=1 / widthes[i + 1])
            weights['LSTMWo%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWo%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            biases['LSTMbf%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbf%d' % (i + 1))
            biases['LSTMbi%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbi%d' % (i + 1))
            biases['LSTMbc%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbc%d' % (i + 1))
            biases['LSTMbo%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbo%d' % (i + 1))
        self.weights.update(weights)
        self.biases.update(biases)

    def weight_variable(self, shape, var_name, distribution='tn', scale=0.1, first_guess=0):
        """Create a variable for a weight matrix.
        Arguments:
            shape -- array giving shape of output weight variable
            var_name -- string naming weight variable
            distribution -- string for which distribution to use for random initialization (default 'tn')
            scale -- (for tn distribution): standard deviation of normal distribution before truncation (default 0.1)
            first_guess -- (for tn distribution): array of first guess for weight matrix, added to tn dist. (default 0)

        Returns:
            a TensorFlow variable for a weight matrix
        Raises ValueError if distribution is filename but shape of data in file does not match input shape
        """
        if distribution == 'tn':
            initial = tf.truncated_normal(shape, stddev=scale, dtype=tf.float32)
        elif distribution == 'xavier':
            scale = 4 * np.sqrt(6.0 / (shape[0] + shape[1]))
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        elif distribution == 'dl':
            # see page 295 of Goodfellow et al's DL book
            # divide by sqrt of m, where m is number of inputs
            scale = 1.0 / np.sqrt(shape[0])
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        elif distribution == 'he':
            # from He, et al. ICCV 2015 (referenced in Andrew Ng's class)
            # divide by m, where m is number of inputs
            scale = np.sqrt(2.0 / shape[0])
            initial = tf.random_normal(shape, mean=0, stddev=scale, dtype=tf.float32)
        elif distribution == 'glorot_bengio':
            # see page 295 of Goodfellow et al's DL book
            scale = np.sqrt(6.0 / (shape[0] + shape[1]))
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        else:
            initial = np.loadtxt(distribution, delimiter=',', dtype=np.float32)
            if (initial.shape[0] != shape[0]) or (initial.shape[1] != shape[1]):
                raise ValueError(
                    'Initialization for %s is not correct shape. Expecting (%d,%d), but find (%d,%d) in %s.' % (
                        var_name, shape[0], shape[1], initial.shape[0], initial.shape[1], distribution))
        return tf.Variable(initial, name=var_name)

    def bias_variable(self, shape, var_name, distribution=''):
        """Create a variable for a bias vector.
        Arguments:
            shape -- array giving shape of output bias variable
            var_name -- string naming bias variable
            distribution -- string for which distribution to use for random initialization (file name) (default '')
        Returns:
            a TensorFlow variable for a bias vector
        """
        if distribution:
            initial = np.genfromtxt(distribution, delimiter=',', dtype=np.float32)
        else:
            initial = tf.constant(0.0, shape=shape, dtype=tf.float32)
        return tf.Variable(initial, name=var_name)

其实还有另一种形式,是我看到有人这样做,github也有代码,不是用tensorflow等直接调用API,而是手动定义权重数组用矩阵乘法进行编程的

网络结构是一样的,和上面的区别是这种形式不是简单地进行串联(参考上面的图),而是每个t-1时刻到t时刻的隐含层之间用权重来连接,这样的话就相当于每层(每个LSTM神经元)都多了4个权重数组。但是我经过调试,如果采用我上面的编程方式,直接用tensorflow来构建LSTM网络,tensorflow里并不是采用的下面的方法 \(\begin{array}{c} \,\,\text{InputGate: } i_t=\sigma \left( W_{xi}x_t+W_{hi}h_{t-1}+b_i \right)\\ \text{ForgetGate: }f_t=\sigma \left( W_{xf}x_t+W_{hf}h_{t-1}+b_f \right)\\ \,\,\text{OutputGate: }o_t=\sigma \left( W_{xo}x_t+W_{ho}h_{t-1}+b_o \right)\\ \,\,\text{Input ModulationGate: }g_t=\tan\text{h}\left( W_{xc}x_t+W_{hc}h_{t-1}+b_c \right)\\ c_t=f_t\otimes c_{t-1}+i_t\otimes g_t\\ h_t=o_t\otimes \tan\text{h}\left( c_t \right)\\ \end{array}\)

更多内容可以访问知乎主页