
多层LSTM使用多TF API和手动同时实现

classic RNN


这张图是根据自己的理解画的,如有协同,纯属巧合。本文中的RNN只采用两个网络层,各个维度我也标明了,其中$\omega_i$ 表示第$i$个RNN的权重,$h_{t}^{i}$ 表示第$t$时刻第$i$个RNN的隐含层神经元个数。我想这个图已经清楚地表达多层RNN是怎样前向传播的了。值得说明的是,一个RNN神经元就是一个全连接层,上面说的两层就是两个RNN神经元,再多层也是一样往后叠加,但是权重是共享的,也就是说上面两个RNN神经元,不管你的$t$运算多少次,都只有两个权重层。




图中$n_i$表示第$i$个LSTM神经元的隐含层神经元个数,$\omega_j^i$表示第$i$个LSTM神经元(第$i$层)中第$j$个位置(位置指的是”遗忘门、输出门等”)的权重,一个LSTM神经元中,总共有4个权重数组,他们的shape都为”[当前LSTM神经元输入维度+当前隐含层神经元个数,当前隐含层神经元个数]”,$b_j^i$表示第$i$个LSTM神经元中第$j$个位置(位置指的是”遗忘门、输出门等”)的偏置。输入$x_t$有两个维度,第一个维度$s$为batch_size,$m$为输入的维度,其实编程的时候,输入是有三个维度的,分别是(batch_size, seq_length, input_dim),但是每个LSTM神经元在每个 t 时刻只接收(batch_size, input_dim)的数据,一共接收 seq_length 次。



使用TensorFlow API实现


首先给出参考的链接2, 参考3,这个参考详细地讲解了tf.nn.dynamic_rnn()这个函数



###################### 创建多层LSTM 网络 ################
# 第一个LSTM神经元隐含层20个神经元,第二个SLTM 隐含层有30个神经元
hidden_size = [20, 30]
# 使用多层的LSTM结构
cell = tf.nn.rnn_cell.MultiRNNCell([tf.nn.rnn_cell.BasicLSTMCell(i) for i in hidden_size])
# 使用Tensorflow 接口将多层的LSTM结构连接成RNN网络并计算其前向传播结果
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
# dynamic_rnn()注意点
outputs, states = tf.nn.dynamic_rnn(
# cell -- LSTM 神经元,可以是lstm, rnn, gru
# inputs -- 训练数据的输入,维度:(batch_size, seq_length, input_dim)
# time_major -- 默认false,inputs 和outputs 张量的形状格式。如果为True,则这些张量都应该是(都会是)[max_time, batch_size, hidden_size]。如果为false,则这些张量都应该是(都会是)[batch_size,max_time, hidden_size]。time_major=true说明输入和输出tensor的第一维是max_time。否则为batch_size。
# outputs -- 输出每个时刻的最后一个RNN(LSTM/GRU)神经元的输出,即图中的h,shape由time_major确定
# states-- states表示最终的状态,也就是序列中最后一个cell输出的状态。一般情况下states的形状为 [batch_size, cell.output_size ],但当输入的cell为BasicLSTMCell时,state的形状为[2,batch_size, cell.output_size ],其中2也对应着LSTM中的cell state和hidden state。(具体请查看参考链接2)
# 取最后一个时刻的输出连接到全连接层,注意outputs输出是h,而不是C
outputs = outputs[:, -1, :]
# 输出连接全连接层做预测,对lstm层的输出再加上上一个全连接层。
predictions = tf.contrib.layers.fully_connected(outputs, 1, activation_fn=None)

到这之后,拿到预测,怎么设计 loss functions 由自己做了。


# 不提倡: 每个LSTM的隐含层神经元是一样的,且权重是共享的
basic_cell = tf.nn.rnn_cell.BasicLSTMCell(rnn_unit)
multi_cell = tf.nn.rnn_cell.MultiRNNCell([basic_cell]*layer_num)

Tensorflow 官网提倡这样写:

# num_units 里存放各个LSTM神经元的隐含层个数
num_units = [128, 64]
cells = tf.nn.rnn_cell.BasicLSTMCell[(num_units=n) for n in num_units]
stacked_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(cells)


下面的实现是不直接使用TensorFlow LSTM API 实现的LSTM,需要注意的是,LSTM网络大家一般在time_step=0时采用0矩阵作为输入(即H,C在t0初始化为不可训练的全0矩阵),而我这里用了一些可以训练的权重矩阵: 也即图中的红色部分,一般来说会直接用不可训练的0矩阵作为输入


TF: 1.13.1

python: 3.6.8

Platform: Windows10 + Anaconda + Pycharm

# Author: 烟酒僧
# Date: 2020-03-29
# Function: Realize LSTM NN through basic TF API
import tensorflow as tf
import numpy as np
class LSTM(object):
    def __init__(self, input_dim):
        # input_dim is the dimension of inputs
        self.input_dim = input_dim
        self.width = [input_dim, 32, 32, 64]
        # time step of LSTM
        self.time_step = 30
        output = self.encoder_lstm()

    def lstm_init(self):
        initialize all weights
        :return: None
        self.x = tf.placeholder(tf.float32, [self.time_step, None, self.input_dim])
        widthes = self.width
        weights = dict()
        biases = dict()
        # weights for h0 and c0
        for i in np.arange(len(widthes) - 1):
            weights['WEH%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEH%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['WEC%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEC%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
        # weights for LSTM
        for i in np.arange(len(widthes) - 1):
            weights['LSTMWf%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWf%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWi%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWi%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWc%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWc%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            weights['LSTMWo%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWo%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
            biases['LSTMbf%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbf%d' % (i + 1))
            biases['LSTMbi%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbi%d' % (i + 1))
            biases['LSTMbc%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbc%d' % (i + 1))
            biases['LSTMbo%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbo%d' % (i + 1))

    def encoder_lstm(self, args):
        # self.x = tf.placeholder(tf.float32, [args.num_shifts + 1, None, args.input_dim])
        input_h = tf.squeeze(self.x[0, :, :])
        input_c = tf.squeeze(self.x[0, :, :])
        widthes = self.width
        num_lstm = len(widthes) - 1
        h, c = [], []
        for i in np.arange(num_lstm):
            input_h = tf.nn.tanh(tf.matmul(input_h, self.weights['WEH%d' % (i + 1)]))
            input_c = tf.nn.tanh(tf.matmul(input_c, self.weights['WEC%d' % (i + 1)]))
        output = []
        for i in np.arange(self.time_step):
            x = tf.squeeze(self.x[i, :, :])
            # multi-LSTM
            h, c = self.lstm_one_shift(x, h, c, num_lstm)
        return output

    def lstm_one_shift(self, x, h, c, num_lstm):
        forward one time step for multi-lstm
        :param x: the input, this is the hidden layer of the last LSTM unit
        :param h: the hidden layer of the same LSTM of the last time step
        :param c: the output layer of the same LSTM of the last time step
        :param num_lstm: the number of the LSTM NN
        for j in np.arange(num_lstm):
            temp_h, temp_c = self.lstm_unit(x, h[j], c[j], j + 1)
            h[j] = temp_h
            c[j] = temp_c
            x = h[j]
        return h, c

    def lstm_unit(self, x, h, c, k):
        It is for calculating a single LSTM unit. n1, n2 denotes neural number of the last and current LSTM hidden layer
        :param x: the input, shape = (b, m), where b is the batch size, m is the dim of the input or the dim of the last hidden layer
        :param h: the hidden output of the last LSTM, shape = (b, n1)
        :param c: the output of the last LSTM, shape = (b, n1)
        :param k: k denotes that we can use the weights and biases of the k-th LSTM units
        :return: h1 = (b, n2), c1 = (b, n2)
        input = tf.concat([h, x], axis=1)
        f = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWf%d' % k], self.biases['LSTMbf%d' % k]))
        i = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWi%d' % k], self.biases['LSTMbi%d' % k]))
        C = tf.nn.tanh(tf.nn.xw_plus_b(input, self.weights['LSTMWc%d' % k], self.biases['LSTMbc%d' % k]))
        c = tf.multiply(f, c) + tf.multiply(i, C)
        O = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWo%d' % k], self.biases['LSTMbo%d' % k]))
        h = tf.multiply(O, tf.nn.tanh(C))
        return h, c

    def weight_variable(self, shape, var_name, distribution='tn', scale=0.1, first_guess=0):
        """Create a variable for a weight matrix.
            shape -- array giving shape of output weight variable
            var_name -- string naming weight variable
            distribution -- string for which distribution to use for random initialization (default 'tn')
            scale -- (for tn distribution): standard deviation of normal distribution before truncation (default 0.1)
            first_guess -- (for tn distribution): array of first guess for weight matrix, added to tn dist. (default 0)

            a TensorFlow variable for a weight matrix
        Raises ValueError if distribution is filename but shape of data in file does not match input shape
        if distribution == 'tn':
            initial = tf.truncated_normal(shape, stddev=scale, dtype=tf.float32)
        elif distribution == 'xavier':
            scale = 4 * np.sqrt(6.0 / (shape[0] + shape[1]))
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        elif distribution == 'dl':
            # see page 295 of Goodfellow et al's DL book
            # divide by sqrt of m, where m is number of inputs
            scale = 1.0 / np.sqrt(shape[0])
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
        elif distribution == 'he':
            # from He, et al. ICCV 2015 (referenced in Andrew Ng's class)
            # divide by m, where m is number of inputs
            scale = np.sqrt(2.0 / shape[0])
            initial = tf.random_normal(shape, mean=0, stddev=scale, dtype=tf.float32)
        elif distribution == 'glorot_bengio':
            # see page 295 of Goodfellow et al's DL book
            scale = np.sqrt(6.0 / (shape[0] + shape[1]))
            initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
            initial = np.loadtxt(distribution, delimiter=',', dtype=np.float32)
            if (initial.shape[0] != shape[0]) or (initial.shape[1] != shape[1]):
                raise ValueError(
                    'Initialization for %s is not correct shape. Expecting (%d,%d), but find (%d,%d) in %s.' % (
                        var_name, shape[0], shape[1], initial.shape[0], initial.shape[1], distribution))
        return tf.Variable(initial, name=var_name)

    def bias_variable(self, shape, var_name, distribution=''):
        """Create a variable for a bias vector.
            shape -- array giving shape of output bias variable
            var_name -- string naming bias variable
            distribution -- string for which distribution to use for random initialization (file name) (default '')
            a TensorFlow variable for a bias vector
        if distribution:
            initial = np.genfromtxt(distribution, delimiter=',', dtype=np.float32)
            initial = tf.constant(0.0, shape=shape, dtype=tf.float32)
        return tf.Variable(initial, name=var_name)


网络结构是一样的,和上面的区别是这种形式不是简单地进行串联(参考上面的图),而是每个t-1时刻到t时刻的隐含层之间用权重来连接,这样的话就相当于每层(每个LSTM神经元)都多了4个权重数组。但是我经过调试,如果采用我上面的编程方式,直接用tensorflow来构建LSTM网络,tensorflow里并不是采用的下面的方法 \(\begin{array}{c} \,\,\text{InputGate: } i_t=\sigma \left( W_{xi}x_t+W_{hi}h_{t-1}+b_i \right)\\ \text{ForgetGate: }f_t=\sigma \left( W_{xf}x_t+W_{hf}h_{t-1}+b_f \right)\\ \,\,\text{OutputGate: }o_t=\sigma \left( W_{xo}x_t+W_{ho}h_{t-1}+b_o \right)\\ \,\,\text{Input ModulationGate: }g_t=\tan\text{h}\left( W_{xc}x_t+W_{hc}h_{t-1}+b_c \right)\\ c_t=f_t\otimes c_{t-1}+i_t\otimes g_t\\ h_t=o_t\otimes \tan\text{h}\left( c_t \right)\\ \end{array}\)
