This is a character-level Recurrent Neural Network approach to generating new scripts based on the transcript of the entire Balance arc of The Adventure Zone, a well-known tabletop roleplaying and D&D podcast.
To begin, the entire script from The Adventure Zone Season 1: Balance is loaded and then transformed into lowercase characters for training.
# load in data
import helper
data_dir = './taz.txt'
text = helper.load_data(data_dir)
text = text.lower()
You can see that the entire script is now all lowercase text and each new line of dialogue is separated by a newline character \n
.
import numpy as np
view_line_range = (63, 85)
print('Number of unique words: {}'.format(len({word: None for word in text.split()})))
lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))
print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))
To create a word embedding, the words are first transformed to encodings, or ids.
vocab_to_int
int_to_vocab
These dictionaries are returned in the following tuple (vocab_to_int, int_to_vocab)
from collections import Counter
def create_lookup_tables(text):
"""
Create lookup tables for vocabulary
:param text: The text of tv scripts split into words
:return: A tuple of dicts (vocab_to_int, int_to_vocab)
"""
word_counts = Counter(text)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
# return tuple
return (vocab_to_int, int_to_vocab)
The script is split into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids.
The function token_lookup
will return a dictionary that will be used to tokenize the following like "!" into "
This dictionary will be used to tokenize the symbols and add the delimiter (space) around it. This separates each symbol as its own word, making it easier for the neural network to predict the next word.
def token_lookup():
"""
Generate a dict to turn punctuation into a token.
:return: Tokenized dictionary where the key is the punctuation and the value is the token
"""
token_dict = {
".": "<PERIOD>",
",": "<COMMA>",
'"': "<QUOTATION_MARK>",
":": "<COLON>",
";": "<SEMICOLON>",
"!": "<EXCLAMATION_MARK>",
"?": "<QUESTION_MARK>",
"(": "<LEFT_PAREN>",
")": "<RIGHT_PAREN>",
"[": "<LEFT_BRACKET>",
"]": "<RIGHT_BRACKET>",
"{": "<LEFT_BRACE>",
"}": "<RIGHT_BRACE>",
"-": "<HYPHEN>",
"–": "<EN_DASH>",
"—": "<EM_DASH>",
"\n": "<RETURN>",
"&": "<AMPERSAND>",
"…": "<ELLIPSIS>"
}
return token_dict
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)
import helper
int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
import torch
# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
print('No GPU found.')
Starting with the preprocessed input data, TensorDataset will be used to provide a known format to our dataset in combination with DataLoader. These will handle batching, shuffling, and other dataset iteration functions.
data = TensorDataset(feature_tensors, target_tensors)
data_loader = torch.utils.data.DataLoader(data,
batch_size=batch_size)
The batch_data
function batches words
data into chunks of size batch_size
using the TensorDataset
and DataLoader
classes.
For example, say we have these as input:
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
Your first feature_tensor
contains the values:
[1, 2, 3, 4]
And the corresponding target_tensor
is just the next "word"/tokenized word value:
5
This will then continue with the second feature_tensor
, target_tensor
being:
[2, 3, 4, 5] # features
6 # target
from torch.utils.data import TensorDataset, DataLoader
def batch_data(words, sequence_length, batch_size):
"""
Batch the neural network data using DataLoader
:param words: The word ids of the TV scripts
:param sequence_length: The sequence length of each batch
:param batch_size: The size of each batch; the number of sequences in a batch
:return: DataLoader with batched data
"""
text, labels = [], []
for i in range(len(words)-sequence_length):
text.append(words[i:i+sequence_length])
labels.append(words[i+sequence_length])
data = TensorDataset(torch.from_numpy(np.array(text)),
torch.from_numpy(np.array(labels)))
dataloader = DataLoader(data, shuffle=True, batch_size=batch_size)
# return a dataloader
return dataloader
Below, some test text data is generated and a dataloader is defined using the function above. Then, a sample batch of inputs sample_x
and targets sample_y
are retrieved from the dataloader.
torch.Size([10, 5])
tensor([[ 28, 29, 30, 31, 32],
[ 21, 22, 23, 24, 25],
[ 17, 18, 19, 20, 21],
[ 34, 35, 36, 37, 38],
[ 11, 12, 13, 14, 15],
[ 23, 24, 25, 26, 27],
[ 6, 7, 8, 9, 10],
[ 38, 39, 40, 41, 42],
[ 25, 26, 27, 28, 29],
[ 7, 8, 9, 10, 11]])
torch.Size([10])
tensor([ 33, 26, 22, 39, 16, 28, 11, 43, 30, 12])
The sample_x should be of size (batch_size, sequence_length)
or (10, 5) in this case and sample_y should just have one dimension: batch_size (10).
You should also notice that the targets, sample_y, are the next value in the ordered test_text data. So, for an input sequence [ 28, 29, 30, 31, 32]
that ends with the value 32
, the corresponding output should be 33
.
# test dataloader
test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)
data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()
print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)
Implement an RNN using PyTorch's Module class. You may choose to use a GRU or an LSTM. To complete the RNN, you'll have to implement the following functions for the class:
__init__
- The initialize function. init_hidden
- The initialization function for an LSTM/GRU hidden stateforward
- Forward propagation function.The initialize function should create the layers of the neural network and save them to the class. The forward propagation function will use these layers to run forward propagation and generate an output and a hidden state.
The output of this model should be the last batch of word scores after a complete sequence has been processed. That is, for each input sequence of words, we only want to output the word scores for a single, most likely, next word.
lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)
# reshape into (batch_size, seq_length, output_size)
output = output.view(batch_size, -1, self.output_size)
# get last batch
out = output[:, -1]
import torch.nn as nn
class RNN(nn.Module):
def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
"""
Initialize the PyTorch RNN Module
:param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
:param output_size: The number of output dimensions of the neural network
:param embedding_dim: The size of embeddings, should you choose to use them
:param hidden_dim: The size of the hidden layer outputs
:param dropout: dropout to add in between LSTM/GRU layers
"""
super(RNN, self).__init__()
# set class variables
self.output_size = output_size
self.n_layers = n_layers
self.hidden_dim = hidden_dim
# define model layers
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_size)
def forward(self, nn_input, hidden):
"""
Forward propagation of the neural network
:param nn_input: The input to the neural network
:param hidden: The hidden state
:return: Two Tensors, the output of the neural network and the latest hidden state
"""
# TODO: Implement function
batch_size = nn_input.size(0)
nn_input = nn_input.long()
embeds = self.embedding(nn_input)
lstm_out, hidden = self.lstm(embeds, hidden)
lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
out = self.fc(lstm_out)
out = out.view(batch_size, -1, self.output_size)
out = out[:, -1]
# return one batch of output word scores and the hidden state
return out, hidden
def init_hidden(self, batch_size):
'''
Initialize the hidden state of an LSTM/GRU
:param batch_size: The batch_size of the hidden state
:return: hidden state of dims (n_layers, batch_size, hidden_dim)
'''
# Implement function
weight = next(self.parameters()).data
# initialize hidden state with zero weights, and move to GPU if available
if (train_on_gpu):
hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
else:
hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
return hidden
Use the RNN class you implemented to apply forward and back propagation. This function will be called, iteratively, in the training loop as follows:
loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)
And it should return the average loss over a batch and the hidden state returned by a call to RNN(inp, hidden)
. Recall that you can get this loss by computing it, as usual, and calling loss.item()
.
If a GPU is available, you should move your data to that GPU device, here.
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
"""
Forward and backward propagation on the neural network
:param decoder: The PyTorch Module that holds the neural network
:param decoder_optimizer: The PyTorch optimizer for the neural network
:param criterion: The PyTorch loss function
:param inp: A batch of input to the neural network
:param target: The target output for the batch of input
:return: The loss and the latest hidden state Tensor
"""
# move data to GPU, if available
if train_on_gpu:
rnn.cuda()
inp, target = inp.cuda(), target.cuda()
# perform backpropagation and optimization
hidden = tuple([each.data for each in hidden])
rnn.zero_grad()
out, hidden = rnn(inp, hidden)
loss = criterion(out, target)
loss.backward(retain_graph=True)
nn.utils.clip_grad_norm_(rnn.parameters(), 5)
optimizer.step()
# return the loss over a batch and the hidden state produced by our model
return loss.item(), hidden
With the structure of the network complete and data ready to be fed in the neural network, it's time to train it.
The training loop is implemented for you in the train_decoder
function. This function will train the network over all the batches for the number of epochs given. The model progress will be shown every number of batches. This number is set with the show_every_n_batches
parameter. You'll set this parameter along with other parameters in the next section.
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
best_loss = np.Inf
batch_losses = []
report = 0
losses = []
rnn.train()
print("Training for %d epoch(s)..." % n_epochs)
for epoch_i in range(1, n_epochs + 1):
# initialize hidden state
hidden = rnn.init_hidden(batch_size)
for batch_i, (inputs, labels) in enumerate(train_loader, 1):
# make sure you iterate over completely full batches, only
n_batches = len(train_loader.dataset)//batch_size
if(batch_i > n_batches):
break
# forward, back prop
loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)
# record loss
batch_losses.append(loss)
# printing loss stats
if batch_i % show_every_n_batches == 0:
avg_loss = np.average(batch_losses)
print('Epoch: {:>4}/{:<4} Loss: {}\n'.format(
epoch_i, n_epochs, avg_loss))
losses.append([report, avg_loss])
report += 1
if avg_loss < best_loss:
helper.save_model('./save/trained_rnn', rnn)
print('New Best Loss: Model Saved')
best_loss = avg_loss
batch_losses = []
# returns a trained rnn
return rnn, losses
Set and train the neural network with the following parameters:
sequence_length
is the length of a sequence.batch_size
is the batch size.num_epochs
is the number of epochs to train for.learning_rate
is the learning rate for the Adam optimizer.vocab_size
is the number of unique tokens in our vocabulary.output_size
is the desired size of the output.embedding_dim
is the embedding dimension; much, much smaller than the vocab_size.hidden_dim
is the hidden dimension of the RNN.n_layers
is the number of layers/cells in the RNN.show_every_n_batches
is the number of batches at which the neural network should print its training progress.If the network isn't getting your desired results, tweak these parameters and/or the layers in the RNN
class.
# Data params
# Sequence Length
sequence_length = 12 # of words in a sequence
# Batch Size
batch_size = 128
# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)
# Training parameters
# Number of Epochs
num_epochs = 20
# Learning Rate
learning_rate = 0.001
# Model parameters
# Vocab size
vocab_size = len(int_to_vocab) + 1
# Output size
output_size = len(int_to_vocab)
# Embedding Dimension
embedding_dim = 500
# Hidden Dimension
hidden_dim = 1200
# Number of RNN Layers
n_layers = 2
# Show stats for every n number of batches
show_every_n_batches = 1000
A number of different combinations of hyperparameters settings were tested one one by one to determine which were among the most effective. show_every_n_batches was set to a low value, 150, and these combinations were trained 1 or 2 epochs total in order to find promising combinations before committing to prolonged training.
In the next cells, neural network is trained on the pre-processed data. Proper loss development isn't guarenteed, and the hyperparameters may need adjusted multiple times to find a good combination for the training data and task. In general, you may get better results with larger hidden and n_layer dimensions, but larger models take a longer time to train.
A respectable loss to aim for is less than 3.5.
Different sequence lengths should be experimented with as this length determines the size of the long range dependencies that a model can learn.
This project was originally trained on Udacity-provided servers. The active_session() from the following cell block keeps the connection alive while training completes. For replicating this on your own machines, this cell block and the "with active_session():" line from the training cell should be removed.
import signal
from contextlib import contextmanager
import requests
DELAY = INTERVAL = 4 * 60 # interval time in seconds
MIN_DELAY = MIN_INTERVAL = 2 * 60
KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive"
TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token"
TOKEN_HEADERS = {"Metadata-Flavor":"Google"}
def _request_handler(headers):
def _handler(signum, frame):
requests.request("POST", KEEPALIVE_URL, headers=headers)
return _handler
@contextmanager
def active_session(delay=DELAY, interval=INTERVAL):
"""
Example:
from workspace_utils import active session
with active_session():
# do long-running work here
"""
token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text
headers = {'Authorization': "STAR " + token}
delay = max(delay, MIN_DELAY)
interval = max(interval, MIN_INTERVAL)
original_handler = signal.getsignal(signal.SIGALRM)
try:
signal.signal(signal.SIGALRM, _request_handler(headers))
signal.setitimer(signal.ITIMER_REAL, delay, interval)
yield
finally:
signal.signal(signal.SIGALRM, original_handler)
signal.setitimer(signal.ITIMER_REAL, 0)
def keep_awake(iterable, delay=DELAY, interval=INTERVAL):
"""
Example:
from workspace_utils import keep_awake
for i in keep_awake(range(5)):
# do iteration with lots of work here
"""
with active_session(delay, interval): yield from iterable
with active_session():
# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
rnn.cuda()
# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
# training the model
trained_rnn, losses = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)
The training was cut off here before the full 20 epochs due to time constraints, but reasonable results should be available from the current loss values. As the training was interrupted manually, the losses were not returned by the train_rnn function, and have been manually copied and entered into a list for plotting below.
losses = [5.023212515354157, 4.523681572914123, 4.357919644832611, 4.256966039896011, 4.206574712276459, 4.169255956172943, 4.135705820322037, 4.114186443090439, 4.074571540594101, 3.841714834509068, 3.8300334751605987, 3.8445535378456115, 3.826541626930237, 3.838555276632309, 3.827766616344452, 3.839187493562698, 3.832147630214691, 3.842352229595184, 3.575420247612399, 3.5771549525260924, 3.5859152026176453, 3.61576287817955, 3.6328304860591887, 3.644551205396652, 3.645131004810333, 3.662799038171768, 3.707046409368515, 3.399345034763543, 3.4146838200092318, 3.42980473613739, 3.444021119117737, 3.4838025910854338, 3.4844834179878235, 3.5134334399700164, 3.5272429411411284, 3.54845725107193, 3.242033117051129, 3.273796578168869, 3.28691948223114, 3.313450812101364, 3.3269090163707733, 3.3544340674877167, 3.3820889763832094, 3.4025665228366853, 3.4252790093421934, 3.129925215845444, 3.1412453763484955, 3.1792691149711607, 3.1948207347393036, 3.2189578416347504, 3.246541239976883, 3.2866171128749846, 3.3054614861011506, 3.335029572725296, 3.020186313552929, 3.0533013434410097,3.0772991843223574, 3.1019813883304597, 3.1202284171581267, 3.169564992427826, 3.184407285213470, 3.2218992977142333, 3.240302947998047, 2.935693301664547]
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(losses)
plt.title("TAZ Script Generation RNN Training Progress")
plt.xlabel("1000th Batch Results")
plt.ylabel("Model Loss")
plt.show()
The loss dropped quite well with these hyperparameters, respresenting a model that was effectively learning the patterns and structure of its training data. Seeing as the model never actually plateaued, further training would have continued to improve the model's performance for quite some time. Finally, notice how the loss descends in a staircase pattern. This is because the batch-averaged loss was recorded every 1,000 batches and not once per epoch. Each large step down in training loss is actually the beginning of a new training epoch and a new average loss after the completion of the previous.
After running the above training cell, the model is saved to the filename, trained_rnn
. The notebook can be resumed from here by running the next cell, which will load in the word:id dictionaries and load in the saved model by name.
import torch
import helper
import numpy as np
_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')
With the network trained and saved, it can be used to generate a new, "fake" The Adventure Zone script.
To generate the text, the network needs to start with a priming word or phrase and repeat its predictions until it reaches a set length. The generate
function is used to do this. It takes a word id to start with, prime_id
, and generates a set length of text, predict_len
. Also note that it uses topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores!
import torch.nn.functional as F
def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
"""
Generate text using the neural network
:param decoder: The PyTorch Module that holds the trained neural network
:param prime_id: The word id to start the first prediction
:param int_to_vocab: Dict of word id keys to word values
:param token_dict: Dict of puncuation tokens keys to puncuation values
:param pad_value: The value used to pad a sequence
:param predict_len: The length of text to generate
:return: The generated text
"""
rnn.eval()
# create a sequence (batch_size=1) with the prime_id
current_seq = np.full((1, sequence_length), pad_value)
current_seq[-1][-1] = prime_id
predicted = [int_to_vocab[prime_id]]
for _ in range(predict_len):
if train_on_gpu:
current_seq = torch.LongTensor(current_seq).cuda()
else:
current_seq = torch.LongTensor(current_seq)
# initialize the hidden state
hidden = rnn.init_hidden(current_seq.size(0))
# get the output of the rnn
output, _ = rnn(current_seq, hidden)
# get the next word probabilities
p = F.softmax(output, dim=1).data
if(train_on_gpu):
p = p.cpu() # move to cpu
# use top_k sampling to get the index of the next word
top_k = 5
p, top_i = p.topk(top_k)
top_i = top_i.numpy().squeeze()
# select the likely next word index with some element of randomness
p = p.numpy().squeeze()
word_i = np.random.choice(top_i, p=p/p.sum())
# retrieve that word from the dictionary
word = int_to_vocab[word_i]
predicted.append(word)
# the generated word becomes the next "current sequence" and the cycle can continue
current_seq = np.roll(current_seq, -1, 1)
current_seq[-1][-1] = word_i
gen_sentences = ' '.join(predicted)
# Replace punctuation tokens
for key, token in token_dict.items():
ending = ' ' if key in ['\n', '(', '"'] else ''
gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
gen_sentences = gen_sentences.replace('\n ', '\n')
gen_sentences = gen_sentences.replace('( ', '(')
# return all the sentences
return gen_sentences
Set gen_length
to the length of script you want to generate and set prime_word
to start the prediction:
You can set the prime word to any word in the encoding id dictionary, but it's best to start with a name from the original Balance Arc (or "previously", as in "Previously on the Adventure Zone") for generating a TAZ script.
# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'previously' # name for starting the script
pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)
# save script to a text file
f = open("generated_script_5.txt","w")
f.write(generated_script)
f.close()
It's alright if the script doesn't make perfect sense. It should look like alternating lines of dialogue. Here is one such example of a few generated lines.
taako.
clint: well, we got that.
griffin: okay. uh, and uh, he-
justin:[ in a high- pitched voice] hey guys.[ pause]
griffin: she says,
lup: okay, i got a— i have a question. i’m gonna need to go to the quarry, and try and deduce what happened, because we have to warn you, but i promise you.
magnus: well, i have a question for you.
hudson: i mean you guys are getting pretty close to the astral plane. you can see it.
magnus:[ whispering] oh!
merle: no, no, no, no, no! no! no. no.
justin:[ crosstalk] oh, i see.
travis:[ crosstalk] i have no idea what the fuck i do.
griffin: yeah, i mean you just chill in that.
justin: okay.
griffin: okay, so this brings a rock around its central neck and you realize that this memory is just radiating you in the ceiling.
clint: okay, and i cast zone on truth.
griffin: okay. you pop open the stair in the gachapon cell, you hear a voice say:
lydia: well, i have a— i had a— i have a +1 skill.
griffin: okay!
travis: okay, i’m gonna roll a d8.
griffin: okay.
travis: and i’m gonna do a nature check now?
griffin: no, he is alive. he’s lying in the center, he’s not especially familiar, he has a big bandage of stars and steel. and then you hear the woman’s voice say,
female elf: i don’t know what to tell you, merle.
clint: alright, let’s go.
griffin: yeah, so you guys have just won a war.
clint: well, i’m not gonna do a perception check
You can see that there are multiple characters that say (somewhat) complete sentences, but it's not going to be perfect. It takes quite a lot of training and a large amount of computing power to get good results, and often, it is necessary to use a reduced vocabulary by discarding unimportant and uncommon words while also getting more data. A reduced vocabulary could be explored here, but the nature of the podcast genre results in a plethora of fantastical words and colorful phrases that are not common or even existent in English, while not even being commonly repeated in the transcript at all. Additionally, with this original season ending in August of 2017, additional training data is not really available. Transcripts from live shows or even future seasons could be utilized to better learn the actors personalities, but extreme care would be necessary to prevent the model from combining character personalities which is well beyond the scope of this project.