前言
首先附上LSTM
的标准讲解,里面包括了你们想要的那几幅图。
这篇文章主要是对我这两天看的LSTM
做一个笔记,主要是要搞清楚输入、输出、网络创建,包括使用Tensorflow
创建LSTM
多层网络的代码。随意百度一下,关于LSTM的讲解实在太多了,但是感觉总是不清晰,可能是我没用心,就浮在表面看一下,因此把自己觉得懂了的部分记录一下,希望能给也在摸索的你一点帮助。老规矩,RNN
–>LSTM
下面各个图中符号的意义(该图来源于网络):
RNN
先上一个大家都放的图(该图来源于网络),来源于网络的图我都会标明,如有侵权,联系删除。
从我个人经验来看,像我一样非科班出生的,一般上来就是LSTM,其实搞明白了经典的RNN,LSTM也是一样的,下面我根据这个画一个容易理解的图(假设对传统的神经网络是了解的):
这张图是根据自己的理解画的,如有协同,纯属巧合。本文中的RNN只采用两个网络层,各个维度我也标明了,其中$\omega_i$ 表示第$i$个RNN的权重,$h_{t}^{i}$ 表示第$t$时刻第$i$个RNN的隐含层神经元个数。我想这个图已经清楚地表达多层RNN是怎样前向传播的了。值得说明的是,一个RNN神经元就是一个全连接层,上面说的两层就是两个RNN神经元,再多层也是一样往后叠加,但是权重是共享的,也就是说上面两个RNN神经元,不管你的$t$运算多少次,都只有两个权重层。
LSTM
再来几张大家经常看到的LSTM图(以下五张图片来源于网络)
这个图是真不想画了,太复杂了,跟RNN是一样的,所以只挑重点,标注维度,这次是根据编程的来,批量训练,嘴上这么说,身体却很诚实,花了两个小时画了张图,好累:
图中$n_i$表示第$i$个LSTM神经元的隐含层神经元个数,$\omega_j^i$表示第$i$个LSTM神经元(第$i$层)中第$j$个位置(位置指的是”遗忘门、输出门等”)的权重,一个LSTM神经元中,总共有4个权重数组,他们的shape都为”[当前LSTM神经元输入维度+当前隐含层神经元个数,当前隐含层神经元个数]”,$b_j^i$表示第$i$个LSTM神经元中第$j$个位置(位置指的是”遗忘门、输出门等”)的偏置。输入$x_t$有两个维度,第一个维度$s$为batch_size,$m$为输入的维度,其实编程的时候,输入是有三个维度的,分别是(batch_size, seq_length, input_dim),但是每个LSTM神经元在每个 t 时刻只接收(batch_size, input_dim)的数据,一共接收 seq_length 次。
再啰嗦一句,$h_{t}^1$,第1个神经元输出后有两个去处,一个是传递到当前时刻的下一个LSTM神经元作为输入,另一个是传回给下一个时刻的自己,作为$h_{t+1}^1$;$C_t^1$则只传回给下一个时刻的自己,作为$C_{t+1}^1$.
编程注意事项
使用TensorFlow API实现
首先给出参考的链接1,这个参考里的代码有一点点问题,就是他把seq_length(循环的次数)和input_dim(输入维度)搞反了
首先给出参考的链接2, 参考3,这个参考详细地讲解了tf.nn.dynamic_rnn()这个函数
首先给出参考的链接4,这个参考讲解了tf.nn.rnn_cell.MultiRNNCell()正确的编程方式
到此,曾经让我非常困惑的LSTM就讲完了,是不是很简单,下面是编程要注意的事项:
###################### 创建多层LSTM 网络 ################
# 第一个LSTM神经元隐含层20个神经元,第二个SLTM 隐含层有30个神经元
hidden_size = [20, 30]
# 使用多层的LSTM结构
cell = tf.nn.rnn_cell.MultiRNNCell([tf.nn.rnn_cell.BasicLSTMCell(i) for i in hidden_size])
# 使用Tensorflow 接口将多层的LSTM结构连接成RNN网络并计算其前向传播结果
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
# dynamic_rnn()注意点
outputs, states = tf.nn.dynamic_rnn(
cell,
inputs,
sequence_length=None,
initial_state=None,
dtype=None,
parallel_iterations=None,
swap_memory=False,
time_major=False,
scope=None
)
# cell -- LSTM 神经元,可以是lstm, rnn, gru
# inputs -- 训练数据的输入,维度:(batch_size, seq_length, input_dim)
# time_major -- 默认false,inputs 和outputs 张量的形状格式。如果为True,则这些张量都应该是(都会是)[max_time, batch_size, hidden_size]。如果为false,则这些张量都应该是(都会是)[batch_size,max_time, hidden_size]。time_major=true说明输入和输出tensor的第一维是max_time。否则为batch_size。
# outputs -- 输出每个时刻的最后一个RNN(LSTM/GRU)神经元的输出,即图中的h,shape由time_major确定
# states-- states表示最终的状态,也就是序列中最后一个cell输出的状态。一般情况下states的形状为 [batch_size, cell.output_size ],但当输入的cell为BasicLSTMCell时,state的形状为[2,batch_size, cell.output_size ],其中2也对应着LSTM中的cell state和hidden state。(具体请查看参考链接2)
# 取最后一个时刻的输出连接到全连接层,注意outputs输出是h,而不是C
outputs = outputs[:, -1, :]
# 输出连接全连接层做预测,对lstm层的输出再加上上一个全连接层。
predictions = tf.contrib.layers.fully_connected(outputs, 1, activation_fn=None)
到这之后,拿到预测,怎么设计 loss functions 由自己做了。
最后再啰嗦一句,就是tf.nn.rnn_cell.MultiRNNCell()
函数的用法,因为我也确实有看到有人这样编程,根据参考4,这是有问题的,因为这样的多层权重是共享的,估计效果就不会这么好了吧
# 不提倡: 每个LSTM的隐含层神经元是一样的,且权重是共享的
basic_cell = tf.nn.rnn_cell.BasicLSTMCell(rnn_unit)
multi_cell = tf.nn.rnn_cell.MultiRNNCell([basic_cell]*layer_num)
Tensorflow 官网提倡这样写:
# num_units 里存放各个LSTM神经元的隐含层个数
num_units = [128, 64]
cells = tf.nn.rnn_cell.BasicLSTMCell[(num_units=n) for n in num_units]
stacked_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(cells)
手动实现
下面的实现是不直接使用TensorFlow LSTM API 实现的LSTM,需要注意的是,LSTM网络大家一般在time_step=0
时采用0矩阵作为输入(即H,C在t0初始化为不可训练的全0矩阵),而我这里用了一些可以训练的权重矩阵: 也即图中的红色部分,一般来说会直接用不可训练的0矩阵作为输入
下面的代码基于:
TF: 1.13.1
python: 3.6.8
Platform: Windows10 + Anaconda + Pycharm
# Author: 烟酒僧
# Date: 2020-03-29
# Function: Realize LSTM NN through basic TF API
import tensorflow as tf
import numpy as np
class LSTM(object):
def __init__(self, input_dim):
# input_dim is the dimension of inputs
self.input_dim = input_dim
self.width = [input_dim, 32, 32, 64]
# time step of LSTM
self.time_step = 30
self.lstm_init()
output = self.encoder_lstm()
def lstm_init(self):
'''
initialize all weights
:return: None
'''
self.x = tf.placeholder(tf.float32, [self.time_step, None, self.input_dim])
widthes = self.width
weights = dict()
biases = dict()
# weights for h0 and c0
for i in np.arange(len(widthes) - 1):
weights['WEH%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEH%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
weights['WEC%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEC%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
# weights for LSTM
for i in np.arange(len(widthes) - 1):
weights['LSTMWf%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWf%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
weights['LSTMWi%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWi%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
weights['LSTMWc%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWc%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
weights['LSTMWo%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWo%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
biases['LSTMbf%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbf%d' % (i + 1))
biases['LSTMbi%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbi%d' % (i + 1))
biases['LSTMbc%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbc%d' % (i + 1))
biases['LSTMbo%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbo%d' % (i + 1))
self.weights.update(weights)
self.biases.update(biases)
def encoder_lstm(self, args):
# self.x = tf.placeholder(tf.float32, [args.num_shifts + 1, None, args.input_dim])
input_h = tf.squeeze(self.x[0, :, :])
input_c = tf.squeeze(self.x[0, :, :])
widthes = self.width
num_lstm = len(widthes) - 1
h, c = [], []
for i in np.arange(num_lstm):
input_h = tf.nn.tanh(tf.matmul(input_h, self.weights['WEH%d' % (i + 1)]))
h.append(input_h)
input_c = tf.nn.tanh(tf.matmul(input_c, self.weights['WEC%d' % (i + 1)]))
c.append(input_c)
output = []
for i in np.arange(self.time_step):
x = tf.squeeze(self.x[i, :, :])
# multi-LSTM
h, c = self.lstm_one_shift(x, h, c, num_lstm)
output.append(h[-1])
return output
def lstm_one_shift(self, x, h, c, num_lstm):
'''
forward one time step for multi-lstm
:param x: the input, this is the hidden layer of the last LSTM unit
:param h: the hidden layer of the same LSTM of the last time step
:param c: the output layer of the same LSTM of the last time step
:param num_lstm: the number of the LSTM NN
:return:
'''
for j in np.arange(num_lstm):
temp_h, temp_c = self.lstm_unit(x, h[j], c[j], j + 1)
h[j] = temp_h
c[j] = temp_c
x = h[j]
return h, c
def lstm_unit(self, x, h, c, k):
'''
It is for calculating a single LSTM unit. n1, n2 denotes neural number of the last and current LSTM hidden layer
:param x: the input, shape = (b, m), where b is the batch size, m is the dim of the input or the dim of the last hidden layer
:param h: the hidden output of the last LSTM, shape = (b, n1)
:param c: the output of the last LSTM, shape = (b, n1)
:param k: k denotes that we can use the weights and biases of the k-th LSTM units
:return: h1 = (b, n2), c1 = (b, n2)
'''
input = tf.concat([h, x], axis=1)
f = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWf%d' % k], self.biases['LSTMbf%d' % k]))
i = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWi%d' % k], self.biases['LSTMbi%d' % k]))
C = tf.nn.tanh(tf.nn.xw_plus_b(input, self.weights['LSTMWc%d' % k], self.biases['LSTMbc%d' % k]))
c = tf.multiply(f, c) + tf.multiply(i, C)
O = tf.nn.sigmoid(tf.nn.xw_plus_b(input, self.weights['LSTMWo%d' % k], self.biases['LSTMbo%d' % k]))
h = tf.multiply(O, tf.nn.tanh(C))
return h, c
def lstm_init(self, args):
self.x = tf.placeholder(tf.float32, [self.time_step, None, self.input_dim])
widthes = self.width
weights = dict()
biases = dict()
# weights for h0 and c0
for i in np.arange(len(widthes) - 1):
weights['WEH%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEH%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
weights['WEC%d' % (i + 1)] = self.weight_variable([widthes[i], widthes[i + 1]], var_name='WEC%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
# weights for LSTM
for i in np.arange(len(widthes) - 1):
weights['LSTMWf%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWf%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
weights['LSTMWi%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWi%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
weights['LSTMWc%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWc%d' % (i + 1), distribution='dl',
scale=1 / widthes[i + 1])
weights['LSTMWo%d' % (i + 1)] = self.weight_variable([widthes[i] + widthes[i + 1], widthes[i + 1]], var_name='LSTMWo%d' % (i + 1), distribution='dl', scale=1 / widthes[i + 1])
biases['LSTMbf%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbf%d' % (i + 1))
biases['LSTMbi%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbi%d' % (i + 1))
biases['LSTMbc%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbc%d' % (i + 1))
biases['LSTMbo%d' % (i + 1)] = self.bias_variable([widthes[i + 1]], var_name='LSTMbo%d' % (i + 1))
self.weights.update(weights)
self.biases.update(biases)
def weight_variable(self, shape, var_name, distribution='tn', scale=0.1, first_guess=0):
"""Create a variable for a weight matrix.
Arguments:
shape -- array giving shape of output weight variable
var_name -- string naming weight variable
distribution -- string for which distribution to use for random initialization (default 'tn')
scale -- (for tn distribution): standard deviation of normal distribution before truncation (default 0.1)
first_guess -- (for tn distribution): array of first guess for weight matrix, added to tn dist. (default 0)
Returns:
a TensorFlow variable for a weight matrix
Raises ValueError if distribution is filename but shape of data in file does not match input shape
"""
if distribution == 'tn':
initial = tf.truncated_normal(shape, stddev=scale, dtype=tf.float32)
elif distribution == 'xavier':
scale = 4 * np.sqrt(6.0 / (shape[0] + shape[1]))
initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
elif distribution == 'dl':
# see page 295 of Goodfellow et al's DL book
# divide by sqrt of m, where m is number of inputs
scale = 1.0 / np.sqrt(shape[0])
initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
elif distribution == 'he':
# from He, et al. ICCV 2015 (referenced in Andrew Ng's class)
# divide by m, where m is number of inputs
scale = np.sqrt(2.0 / shape[0])
initial = tf.random_normal(shape, mean=0, stddev=scale, dtype=tf.float32)
elif distribution == 'glorot_bengio':
# see page 295 of Goodfellow et al's DL book
scale = np.sqrt(6.0 / (shape[0] + shape[1]))
initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
else:
initial = np.loadtxt(distribution, delimiter=',', dtype=np.float32)
if (initial.shape[0] != shape[0]) or (initial.shape[1] != shape[1]):
raise ValueError(
'Initialization for %s is not correct shape. Expecting (%d,%d), but find (%d,%d) in %s.' % (
var_name, shape[0], shape[1], initial.shape[0], initial.shape[1], distribution))
return tf.Variable(initial, name=var_name)
def bias_variable(self, shape, var_name, distribution=''):
"""Create a variable for a bias vector.
Arguments:
shape -- array giving shape of output bias variable
var_name -- string naming bias variable
distribution -- string for which distribution to use for random initialization (file name) (default '')
Returns:
a TensorFlow variable for a bias vector
"""
if distribution:
initial = np.genfromtxt(distribution, delimiter=',', dtype=np.float32)
else:
initial = tf.constant(0.0, shape=shape, dtype=tf.float32)
return tf.Variable(initial, name=var_name)
其实还有另一种形式,是我看到有人这样做,github也有代码,不是用tensorflow等直接调用API,而是手动定义权重数组用矩阵乘法进行编程的。
网络结构是一样的,和上面的区别是这种形式不是简单地进行串联(参考上面的图),而是每个t-1时刻到t时刻的隐含层之间用权重来连接,这样的话就相当于每层(每个LSTM神经元)都多了4个权重数组。但是我经过调试,如果采用我上面的编程方式,直接用tensorflow来构建LSTM网络,tensorflow里并不是采用的下面的方法 \(\begin{array}{c} \,\,\text{InputGate: } i_t=\sigma \left( W_{xi}x_t+W_{hi}h_{t-1}+b_i \right)\\ \text{ForgetGate: }f_t=\sigma \left( W_{xf}x_t+W_{hf}h_{t-1}+b_f \right)\\ \,\,\text{OutputGate: }o_t=\sigma \left( W_{xo}x_t+W_{ho}h_{t-1}+b_o \right)\\ \,\,\text{Input ModulationGate: }g_t=\tan\text{h}\left( W_{xc}x_t+W_{hc}h_{t-1}+b_c \right)\\ c_t=f_t\otimes c_{t-1}+i_t\otimes g_t\\ h_t=o_t\otimes \tan\text{h}\left( c_t \right)\\ \end{array}\)
更多内容可以访问知乎主页