Regular Expressions.: 2016

Monday, September 12, 2016

Machine Learning Adventure 2 - Gradient Descent and the Learning Rate.

So last post we discovered we didn't really know what was going on. Our attempt to bend the character predicting RNN to work with audio data had produced.. interesting (terrible) results, and looking at why we realized we really should stop mucking about and make an attempt to find out what was really going on.

At this point I'm going to start using 'I' rather than 'We'. The 'We' was only really interested in the 'what if we stuffed audio into it?' question, and has now gone back to building complicated DSP things in Plogue.

So. Reading around on what was causing the loss of the network to fluctuate wildly over time, basically diverging from a solution rather than converging to one, it turns out this has something to do with gradient descent and the learning rate. In order to explain what was going on I'll take a simple case of using gradient descent and how it works solving a simple problem. This isn't specifically related to neural networks (It turns out gradient descent is a technique that has more general application), but it will hopefully illustrate the basic idea behind what was going wrong in our attempt to train the char generating rnn with audio data.

Simple linear regression by gradient descent.
(relevant XKCD: Linear Regression)

From wikipedia, "linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression."

Most people know it as 'fitting a straight line to the data'.

Given a set of (x,y) data, we would like to plot a straight line through it to either see a trend, or make possible predictions about what y may be given a particular value for x, or make a decision about what y may be in the future. An example from my day job could be 'Given data on the number of tapes used by our daily backups, at what point in the future might we consider buying more tapes to ensure we don't run out'. Now, there are other ways to achieve this than using gradient descent, but they are not of interest to us here. We want to use gradient descent to hopefully throw (some) light on what (may) have been happening when our network was failing to train.

Lets assume we have been measuring the tape usage over time of our backup system, and we have a dataset that looks like this

Week	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Tape use	13	15	21	18	21	22	21	24	22	25	26	28	25	29

We can see tape use fluctuates over time - some of our data expires and tapes are freed up, but tape use does seem to increase in general - overall we are generating more and more data that needs to be backed up.

(Note, this is an entirely fictional situation - we use very different techniques to actually analyse things like tape use in the real world)

When we plot this out as a graph, we get the following (weeks along the x axis, tapes on the y axis)

It's a fairly simple plot, and it's pretty easy to see where the straight line fit should be; so we can be sure with a visual check that our gradient descent method is actually coming up with the correct fitting line.

Now, we know that the equasion for a straight line is

y=c+mx

So we are looking for the values of c and m that will produce the straight line that best fits our data.

We could, if we really wanted to, start trying values of c and m by hand, and keep drawing out the lines until we found one we are happy with. That might even end up being pretty straightforward for our simple dataset. But judging the line by eye means we may not get it exactly right, and what we really want is an algorithm that will find it for us with the least amount of work possible on our part. I'm a developer, and a lazy developer is a good developer, because they are always looking for ways to get the computer to do as much of the work as possible.

If we want to do this automatically, then first we need a way of judging how good any given pair of c,m values are at generating a line that fits the data. One good way of dong this is to measure how far away all the data points are from the line - the best possible line will make sure all the data points are the smallest distance away possible. We can measure the distance of a datapoint from the line by measuring the y value of our line for a given x value, and subtracting that from the real y (tape use) value of the datapoint at that x (week) position.

So for example, we know that the data says at an x (week) value of 10, the y (tape use) value is 25. Lets suppose we had chosen a value for c as 2 and m as 3, and we want to see how well the line produced by y=c+mx or y=2+3x 'fits' the data at x (week) = 10.

We calculate the value of y=2+3x for x=10, and we find how far away this is from the real data value of week 10 (x) = 25 tapes used (y). We get

Our c and m values give us;
y=2+3x
y=2+(3*10)
y=32

Our real data tells us that y is 25. So, our 'prediction' using c=2 and m=3 is wrong by
32-25 = 7

If we repeat that process for all of the real data points we have, and add up all the differences, we will get the total distance from the line formed by y=2+3x to all of our data points.

or will we?
Some of the values may result in negative distances! if our line predicted a value of 20 but the real data point had a value of y=15 then we would get a distance of -5! obviously negative distances don't make sense here, so we use a mathematical trick to ensure the value is always positive. We square the result for each datapoint, this ensures the distances are all positive (There are some other reasons this is a good thing, but we wont go into that right now. Just trust me and be happy that we have decided to square the resulting distances to make sure they are all positive)

So, we have figured out a way to calculate how badly our line fits the data for a given x value. We will call this the 'Cost function', and we can write it like this; (I'm going to use pseudo-style code, rather than real mathematical symbols. This might be confusing if you come from a mathematical background, but should be fine if you are from a coding background. Deal with it.)

Cost =
j = ((c+mx)-y)^2

We'll use j because we already used c. y is our real data value for any given x. (tapes used for any given week, in our example)

We can calculate the total cost for any line generated by a guessed c,m pair by adding up all the costs for all the real data points we have. So, the full cost (we'll call it J) of a given guess of our c,m values can be written as;

J=sum( ((c+(m*x))-y)^2 )
for all the x and values in our real set of data.

This is known as the square cost function.

Ok. Now we know how to figure out how bad of a guess a particular c,m pair is. Our goal is to minimise that cost, and when we have the smallest cost, we know we have the line represented by c,m that best fits our data.

So now what? Do we just guess loads of values for c & m, calculate the costs and then pick the guess with the smallest cost?

Nope. This is where gradient descent comes in (finally!).

Now.. one more thing. The actual formula for gradient descent in the case of our square cost function (it is different for different functions) requires some figuring out of differentials. Calculus is beyond the scope of this blog (and me, mostly ;) ) so trust me on this next step.
Because of the calculus we don't want to get into right now, we are going to add a 1/(2*n)* to the front of our square cost function J. Trust me, a mathematician told me it's fine. (She whispered something about the differential of x^2 in my ear, and when I woke up she was gone). This makes our cost function J look like this;

J = (1/(2*n)) * sum( ((c+(m*x))-y)^2 )
for all the x and y values in our real set of data.

You might notice an 'n' has crept in there. 'n' is the total number of datapoints we have in our real data set. (14 in our example, for the 14 weeks)

Now, back to gradient descent. It turns out, that if we draw a graph of the full cost J for all possible c,m pairs, we get a graph that looks sort of like this;

The line J represents the full costs for all of the possible c,m values we could choose, when we compare the c,m values with our real data set. What we are interested in is the values of c,m when the line J is it's lowest value, right at the bottom of that curve.

This is the point that Gradient Descent will help us find.

Now, I'm not actually going to go into the maths of gradient descent. We don't need to, to get an intuitive feel for what it does, and besides it involves more calculus.

So what does it do?
Well. In the case of gradient descent for simple linear regression (that's what we're doing - fitting a line to the data) we start by giving Gradient Descent a guessed pair of c,m values. Gradient descent takes them and calculates the cost, and figures out whereabouts on the curve above that cost falls. Then, it looks at the slope of the curve at that point. If the slope is positive (IE, it goes from bottom left to top right) it produces a new guess for c,m that is a litte farther left along the curve. If the slope is negative (it goes from top left to bottom right) it produces a new guess for c,m a little to the right. This picture hopefully helps illustrate that;

Simple, right?.. Gradient descent will take our guess, figure out how much it costs, and then produce a new guess that will nudge the cost value along the curve towards the curves minimum.

Then, you take the new guess produced by gradient descent and feed it right back into the gradient descent formula. You do this iteratively, and eventually it will produce a guess that is very close to the minimum point of the curve. At this point, the slope of the curve is neither negative or positive, and gradient descent will produce a guess that is the same as the number we plugged back into it. Et Voila, we have discovered the c,m values that minimise our cost function. We have found the values that will produce a best fit for our data.

Hooray! Lets go home!...
not so fast!

Remember, we are going through all of this to try and figure out why our neural network didn't appear to be 'learning' when we fed it audio data - the value of the 'cost' for the network was getting bigger! not smaller!

Theres one thing we haven't talked about yet; the Learning Rate. Inside the gradient descent formula there is a term called the learning rate. We will call it L for now. Don't panic, there isn't any more maths coming, it's just easier than typing Learning Rate all the time.

L is a number in the gradient descent formula that multiplies the size of the 'step' taken along the curve when gradient descent makes a new guess for the c,m values. Now, if L is too big, gradient descent might produce a new guess that is much too far along the curve, and it ends up missing the minimum point completely! We carry on feeding the guesses back into gradient descent, and it carries on producing new guesses that overshoot the minimum. If things are really bad, it will even produce guesses that start to climb right back up the curve - ignoring the minimum point altogether. It's guesses just get worse and worse and bigger and bigger until computers fall over and the sky falls down, or something. It doesn't matter. What matters is, it's broken and it isn't going to work.

And essentially, we can see that intuitively as what was happening when our char predicting RNN was having so much trouble trying to learn our audio data. Our 'Cost' was not able to settle on a minimum, and kept hopping about all over the curve. In reality, I'm pretty sure it's much more complex than this, but this essentially captures the problem. Probably.

I say probably because.. even when we made the learning rate very very small indeed, it still had troubles, although the cost would achieve lower values than it did before. So I'm sure I'm still missing something. - probably something to do with trying to make the character predicting RNN work with audio data by ramming an alphabet of 65000 letters into it's inputs and hoping for the best.

Setting aside our attempts to abuse a character predicting RNN to generate audio though, this general intuition regarding gradient descent isn't a bad approximation of what happens when the cost function for a particular neural network starts to get bigger and bigger, or just fails to settle down to a minimum. So that's something!

Next Time...

Next time, we will use all of this to actually fit a straight line to our tape usage data, and we can see some real examples of Gradient Descent and the learning rate in action! - If your curious and know a bit of python, I have an implementation of simple linear regression by gradient descent here Simple Linear regression by Gradient Descent
Have fun!

Machine Learning Adventures 1.5 - Sidetracked by a shiny thing, and time to stop playing and figure out what's going on..

Following on from my previous post, my aim was to start digging into torch and looking at building some basic RNN's to get a better understanding of whats going on.

And then something shiny turned up and I got distracted - after showing a friend the Quckfacts output he piped up 'What happens if we do it with audio?'

Then followed a couple of late nights hacking together some way of transforming audio data into something we could feed a character-predicting RNN, and then feeding and tweaking the RNN training process just to see what would happen, and the realization that to get into this properly means more than just playing about with RNN's for fun and giggles.

Before I dig into that though, I want to go into a little more detail as to whats actually going on when we setup these toy datasets and feed them into the pre-built RNN we are using for character prediction. (Disclaimer; this is all from my current understanding, and I freely admit it might not all be quite right. Hopefully this blog will be all about eventually getting it spot on though, so bear with me..)

What is a Recurrent Neural Network?
My current explanation would be, 'It's like a normal Neural Network, but each of the neurons is capable of maintaining an element of memory, so each iteration of data flowing through the network can be influenced by the history of data that has previously flowed through the network' - Whereas an expert might say

'A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior.'

In practice, this means that RNN's can be very good when working with structured sequences of data - like text. For example, when fed with programming code, an RNN is capable of 'understanding' that it should close a bracket it opened 100 characters ago, or de-dent an indent for a block of code. When fed with english text it is capable of producing coherent sequences of letters as words - this is the 'dynamic temporal behaviour'.

So what does a Neural Network do when it is learning?
Essentially, the network is given input and produces output, this output is compared to the correct output and the network is adjusted in an attempt to become more accurate the next time it produces output. The process of adjusting the networks internals is called 'Back propagation'. For each set of inputs fed into the network, it produces some output. This output is then compared to the correct output and the difference is used to work backwards through the network, adjusting its internal weights toward a configuration that will produce a more accurate output next time. This process is repeated many times until hopefully the networks outputs converge to the correct outputs. The function used to measure the networks accuracy based on a given output is called the Loss function, and the function used to compute the changes made to each neurons weights is called Gradient Descent. The application of Gradient Descent back through the network is called Back Propagation.

So, RNN's have an internal memory, and are good at working with sequences of data whose elements have some relationship to the elements that came before them. This seemed to us to be ideal when applied to audio data. Audio data is inherently made up of cycles over time - waveforms of different frequencies piled up on top of each other to produce the signal we actually hear. So we thought we would give it a try..

We took a naive approach, and decided to look at the audio data in the time domain as a sequence of raw 16 bit samples, where an audio signal in the range -1 -> 1 is mapped to the range 0 - 65535. This meant we could map the 0-65535 possible values of a sample point to a UTF-8 code point, enabling us to produce a single unique symbol -AKA 'a Letter' for each of the values a sample could take, that we could then feed into the character predicting RNN.

We recorded some human speech and exported the data from plogue (audio / dsp software that lets you work at a very low level graphically to build up DSP networks from fundamental components) as raw integer sample values in the range 1-65536.

We set about hacking together a couple of python scripts to go from int16<-utf8->int16

There was a problem; within the UTF-8 set there is a range of characters that are expressed as surrogate pairs - where two individual code points are combined to produce a single character. This was causing problems for the pre-process step required by the char_level rnn. When it hit the UTF codes for the surrogate pairs it would fail. We got around this by 'cleaning' the generated UTF data and forcing all the surrogate pair code points to a different range. This would be mangling the audio data being represented a little, but not enough to cause any real issues. (It would be the equivalent of adding in a couple of clicks and pops to the original audio signal)

So, with the data now 'cleaned' we repeated the training and sampling process used for training against text-based data.

We did get results. Nothing approaching human speech though - the output was essentially very very noisy, we likened it to an uncovered mic in a hurricane. Now, this wasn't a failure, the audio did have some underlying structure (wasn't just noise) that you could imagine was the beginnings of the RNN generating some kind of speech-like structures, but really that could just be audio pareidolia.

We tried tweaking some parameters we could find. But we just seemed to be making it worse. As we did so though, we began to form some idea of what was going on. Especially there was a number we could adjust called the 'Learning rate', and a way to adjust the learning rate over time - we could make the learning rate decrease by a given factor every N 'Epochs' of training. We seemed to be getting better results by making the learning rate smaller, though with no apparent consistency. After a few hours of tweaking and staring at columns of numbers creeping up the screen, we came to the conclusion we didn't really understand what was going on.

Which, we had to admit, past some fuzzy ideas gleaned from skimming the results of google searches.. we didn't. We had some idea that essentially we were jiggling the weights of the network about in some way related to the learning rate, and maybe the higher the rate, the more it was jiggling. And we were trying to get the weights to converge on a 'correct' configuration. Maybe, we thought, if learning was too high it was jiggling too much, and might be jiggling back out of any patterns it found itself in. But then.. if we didn't jiggle it enough, maybe it would never find the convergence we were hoping for...

It turns out, we were not so far from the truth. But also a long way from it :)

At this point, we had hit the limits of 'playing with RNNs for fun' and needed to start looking properly at what was going on in an attempt to understand how we could get better results from the RNN.

While training, the training process outputs the current 'loss' of the network. It turns out as the network trains, we hope the network loss gets smaller - this would indicate a succesful convergence of the network weights to a 'correct' pattern. But for our attempts to work with audio data, this wasn't happening.

This graph illustrates what we were seeing, black is the ideal situation. Green shows how the loss moved when working with english text, and red shows what we were seeing with our audio-to-utf data;

But why? Our sequence of symbols surely is conceptually the same as text? - ok, the underlying structure was different, but the network isn't to know our string of symbols might not be valid text in some strange language somewhere?
How is the loss function being calculated?, just what is it really representing?, and why is it going so horribly wrong with our data?

It turns out to begin to understand this, we need to understand what the 'Loss' actually represents, how something called the 'Learning Rate' affects it, and just what 'Gradient Descent' is and how it applies to neural network learning.

And that will (all?) be in the next blog..

Saturday, September 3, 2016

Machine Learning Adventure 1 - first steps, using a char predicting LSTM RNN with torch-rnn

So, recently I've had the opportunity to start looking into some of the cognitive services being offered by IBM / Google / MS Etc. and this has rekindled my interest in Neural Nets and what is being called 'Deep Learning'. I first heard about Neural Nets as a teenager around the late 80's / early 90's, and while in those days the concept of a neural net was super cool the results were not always great and the mathematical complexity was beyond me. Things have improved a lot since then. Over the last 5 years NN's have made great strides in accuracy. Toolsets, frameworks and services have sprung up to make working with NN's and the services using them much easier to get to grips with from a developer point of view. The maths though, well as far as I am concerned thats still over my head, but is one of the things I hope this little adventure will help with ;)

So I've decided to go deeper into NN's and how they work in deep learning applications. I'm starting with a vague understanding of some of the simpler kinds of NN, some vague idea of terms like back-propagation, and a fair amount of experience as a developer.

Hunting around in general for information on the new fangled 'Recurrent Neural Networks', I came across this post; http://karpathy.github.io/2015/05/21/rnn-effectiveness/ ( Thats a good post to read to get an overview of what an RNN is compared to a 'regular' NN) and this docker container https://github.com/crisbal/docker-torch-rnn and thought I would have a stab at building my-first-rnn / a char level rnn that you can feed text, train, and then get it to spit out examples of text in a similar style. For my first steps though, I will be happy with getting the pre-built RNN in torch-rnn up and running using the GPU to train against some text.

This blog is going to be more bout the technical steps I took; I would advise you to go read the pages linked above and other info on exactly what's occurring in terms of the RNN - I'm not quite qualified enough to be giving decent explanations to that yet! - take most of what I say with a pinch of salt. I'm learning.

So, first things first, I needed to setup a machine to use for all this. I built a beefy machine a while ago when I decided to setup a local hadoop cluster in VMWare redhat guests, because hadoop was all the rage at the time and I wanted to understand it. That worked, but it proved quite difficult to phrase the problem case I had in the map-reduce paradigm (I was playing with Genetic Algorithms for playing GO and wanted to use hadoop to evaluate massive populations in parallel ((it turned out easier to just run as many non-parallel process as I could squeeze onto the CPU cores + RAM I had available and merge populations periodically - but the whole hadoop setup was fun, frustrating and enlightening regardless)).

My desktop was fitted with the following:

* i7-970 @ 3.2ghz / 6-Core, watercooled

* 18GB RAM

* 512GB SSD for OS

* 4TB striped volume set (risky, but I backup the important data for when one of them eventually fails), I did this to get in a similar ballpark to SSD read/write speeds, without paying for 4TB of SSD.

* GeForce 780ti - this is a great gaming GFX card! and has aged fantastically. It also turns out to be great for starting out in deep learning, with great memory bandwidth (336GB/Sec - bandwidth is one of the major indications of performance in deep learning, see http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/ - but falls short a little due to it's architecture, and is restricted by only having 3GB RAM, but thats Ok for starting out)

For step one of this adventure, I wanted to get everything setup and working, and get the pre-built RNN in the torch-rnn container spitting out some text in the style of something other than shakespeare. It's important to understand that it really is 'text in the style of' - there is no understanding of the words and text being generated, so while it looks ok from a distance, actually reading it quickly highlights the fact the actual content is gibberish.

I gave the machine a fresh install of Ubuntu 16.04 LTS, and then installed;

* general apt-get update & upgrade to be all up to date with the basic system.

* Proprietary NVIDIA drivers (required for NVIDIA CUDA to work, this enables GPU acceleration when running the RNN) / with help from http://www.howtogeek.com/242045/how-to-get-the-latest-nvidia-amd-or-intel-graphics-drivers-on-ubuntu/

* mesa-utils - this was just to get glxgears to check NVIDIA drivers were working.

* NVIDIA CUDA drivers / with help from https://devtalk.nvidia.com/default/topic/926383/testing-ubuntu-16-04-for-cuda-development-awesome-integration-/ - tip, just try and install all the stuff in the first table on the page. It all worked first time with apt-get for me. (adjust for various newer versions if you are reading this in the future..)

* Docker engine - just followed the steps for installing on ubuntu from the docker site. Worked first time. (almost, some packages in the documentation for docker are given with a -lts-trusy prefix. I could not install/find these until I removed the -lts-trusty prefix, so look out for that (near the end))

* nvidia-docker - from the github https://github.com/NVIDIA/nvidia-docker/wiki/Installation

I was pleasantly surprised how painless doing all of that was. The worst part was trying to use Ubuntu's partitioning tool to setup disk + bootloader for the install. I must have been missing something but I could not create more than one partition on the disk for install+swap, so ended up with an Ubuntu install with 0 swap space. I have 18GB RAM so I'll probably be ok for a while but I can see some disk shenanigans coming up in the future to add on some swap space and split up the root partition some.

I decided to use docker for a couple of reasons; it's also something I am learning at the moment, and I had seen torch-rnn, a container setup that makes it easy to play with a char-level RNN for generating text. I could go ahead without docker and setup torch and dependencies, but for ease-of-use I decided to just go with containers at first. I wanted to concentrate on doing RNN things, rather than continuing to install even more stuff + learning torch & LUA from scratch. The container gives a nice example set of LUA scripts to poke into and learn from also.

We're almost done! Everything installed fine, and using docker + torch-rnn made actually training an RNN for text generation nice and easy.

I grabbed the torch-rnn container and started it with nvidia-docker, then started a bash shell in the container.

Now, I needed some text to train with. Because of my day job, I knew of this; http://people.bu.edu/rbs/ADSM.QuickFacts This is an amazing body of text built up of useful IBM Spectrum Protect (prev. Tivoli Storage Manager / ADSM) snippets of information. I wanted to train my network to spit out an imitation Spectrum Protect QuickFacts in the style of Richard Sims. It's about 5MB of text.

The quickfacts also have an interesting property I was wondering whether the RNN would learn; while its a plain text 80-column file, it is formatted (mostly) as two 40 char columns, where the left column is headings, and the right column is body text. This means each 80char line usually consists of {40 spaces}{some text} and the body text can be 1 to 10's of lines long. Because of the way LSTM RNN's work, they have a memory that theoretically limits the size (number of chars) of structures they can learn, and the structuring of the quickfacts involves elements much longer than this memory (or rather, too long to train with long enough memories on my setup) - so it will be interesting to see if the RNN learns the structure in spite of this. Theres a good paper here analyzing RNN's, how they perform and the errors they can produce - https://arxiv.org/pdf/1506.02078.pdf - The maths is over my head, but the text + pictures I can (mostly) follow in general.

So, once I had copied the quickfacts text file into the container, I set about applying the RNN training to it;

First the input data is preprocessed,

root@1540b0057079:~/torch-rnn# python preprocess.py --input_file data/qfacts.txt --output_h5 data/qfacts.h5 --output_json qfacts.json

Total vocabulary size: 97
Total tokens in file: 4900717
  Training size: 3920575
  Val size: 490071
  Test size: 490071
Using dtype

Then the training occurs,

root@1540b0057079:~/torch-rnn# th train.lua -input_h5 data/qfacts.h5 -input_json data/qfacts.json -seq_length 320 -rnn_size 256 -max_epochs 350

Running with CUDA on GPU 0 
Epoch 1.00 / 350, i = 1 / 85750, loss = 4.591961 
Epoch 1.01 / 350, i = 2 / 85750, loss = 3.648895 
Epoch 1.01 / 350, i = 3 / 85750, loss = 2.825536 
Epoch 1.02 / 350, i = 4 / 85750, loss = 2.399451 
Epoch 1.02 / 350, i = 5 / 85750, loss = 2.337098 
Epoch 1.02 / 350, i = 6 / 85750, loss = 2.201095 
Epoch 1.03 / 350, i = 7 / 85750, loss = 2.175408 
Epoch 1.03 / 350, i = 8 / 85750, loss = 2.076128 
Epoch 1.04 / 350, i = 9 / 85750, loss = 2.076714

I changed three parameters from the defaults given to train the tiny-shakepspeare dataset that comes with the container.

-seq_length from 50 to 320, my current understanding is this relates to the 'memory length' the RNN should exhibit, I changed to 320 to it covers 4 lines of 80 chars.

-rnn_size from 128 to 256, under advice I read that with a larger dataset for training, this change should improve generating performance. I haven't yet played enough with these or other parameters to really see how they effect the output.

-max_epochs to 350 This essentially determines how long the model will train for. You can get interesting results with less epochs, but training for longer improves things, though there appears to be a tradeoff where going much longer gets smaller improvements.

I did a little bit of quick testing too, with gpu enabled (default), and gpu disabled ( -gpu -1 ) to get a quick idea of the performance difference. Using the GPU seemed to give a speed improvement of ~15x , though I'm not too sure on the accuracy of that because generally speedups are quoted as being in the 5x - 10x range. This speedup let me choose 350 Epochs and have it trained in 8 hours. Using 50 Epochs on a CPU took around 16 hours. In the end, the results were not too different in quality, so I think there are other parameter tweaks I can make (related to gradients and some other stuff I'm not 100% sure about yet) to improve accuracy.

So, 8 hours later I had a trained LSTM RNN that should be able to imitate Richard Sim's Quickfacts.

Here's what I got - I'm going to put several examples, each from different stages through training.

First, here's an example from the actual quickfacts document, this is what we are aiming to imitate;

Immediate Client Actions utility        After using, stop and restart the
                                        scheduler service on the client, so it
                                        can query the server to find out it's
                                        next schedule, which in this case would
                                        the immediate action you created.
                                        Otherwise you will need to wait till the
                                        client checks for its next schedule on
                                        its own. Also affected by the server
                                        'Set RANDomize' command.
"Imperfect collocation"                 A phrase that might be used to describe
                                        the condition that occurs when
                                        collocation is enabled, but there are
                                        insufficient scratch tapes to maintain
                                        full separation of data, such that data
                                        which otherwise would be kept separate
                                        has to be mingled within remaining
                                        volume space.
                                        See:  Collocation
Import                                  To import into a TSM server the
                                        definitions and/or data from another
                                        server where an Export had been done.

The training produces several checkpoint files you can use to sample the networks output at different points in the learning process. The general form of the command to sample the RNN for output is;

th sample.lua -checkpoint {path to checkpoint file} -length {number of characters}
For the example outputs below I chose a length of 2000 characters.

Here's the sampled RNN output;

After 1000 iterations of 85750
Formatting is pretty loose. Plenty of bad words.

?_{WA        See also: Otil| -
                                           lease tape of MEXif ANDEXvia'
Set space stubstobackups                  ckpection mesmigr via through
                                        elarage suble an  <Unix client -racle).
Set SEROCTib=Persages                         Data cant spen valuable.
                                        Due a logical that roing are more
                                        when the if about access if a emrege
                                       sessions writing of 1., begin 3 - the
                                          only - which check thif justy in put.
                                         [Data
                                        Yessward Helingshom Loampp"
                                        storage.
Client server a Cindog files,         /usr/lpp//bloctif, SHow (KH2)
                                                'Query lsg or ax 4 STGPool Statu>
                                          Ref:", the format of a Query Query FIle
                                        privation data will including in seisionsing
                                         storage pools, and abood thats it thre
                                         library and dsm.- which missible as
                                          in the database in a has on each options
                                        if does not after currogs run is
                                         search volume and you doing in anaged
                                        with the node -- All tables* use the
                                         filespile (with force from when the
                                        waiting error, command: there is
                                         host particulard or diskupdate all
                                        libraries "Passic client (stalled.
                                        Storage HODS, Poll averded, the Table

After 2000 iterations
Already the formatting is tightening up, more words are actually words, there's some remembering to close open brackets.

TSM 4:2 (
                                     The tape or cells and causes hished
                                        processing ended in the Maciffites, or
                                        "ACRESTSIcatus, OVERetrieve in that
                                        logging "TCP.) storage logged).
                                        Had locks with TSM client seft. <The
                                        performance of tape. Changing examine,
                                        allows the security in *note has a givaney
                                        tracknocal offsite comes, and the TSM
                                        into a loss numerical perhaper.
dsmrl___DIFIces overb (escond (in archive       4, FAUName volume for TDP
                                          WUTLE OS15304. To four/up no thrsel,
                                         Interface, it has being reduced via:
                                         volume storage and which the password,
                                         files: for example is a file can be
                                         +rune on "Nurbether Files, third signial
                                         server option.  Respeate time, checkin
                                         afters to relieve the statistic to a
                                        the backup of during runs as etclly
                                        messages, affecting zight users it is
                                          started file system may be multiple
                                         Lix FILED furth, char, increase for
                                        to a Storage Pool, up no library, as
                                        (attribute. Activity Console, using the
                                         over *SM issuacialization).
client                                  Ref: INGEDEdec server 1591 bytes).
                                        Loggose "fixed nam

After 10000 iterations
Very good line formatting (IE, 40 spaces then text) even more real words, not many rubbish strings, even some convincing sounding made up words!

7$’gi#       dsmserv -> -F sized - but, the
                                        Firch Sizk for TSM, the tape product of
                                        invocation cannot stop as then holds
                                        be confumed by Ultriumes and then full
                                        by
/mmarb/htp, 10, VROUs(s)                TSM current nodes of the file STGpool and
                                        protection of the '__
                                        IBM Technote: 1008746
/taperate>
                                       SELEckgen Password Vunus Incremental
                                        a file parametel TSM system available
                                        a kegpormance and prompt.
                                        Client system being butted to list archived
                                        wildcards, backup send upon the timeout
                                        operating system type for veriedes or
                                        is including a LAN-free backup.
HSM_CAS,                                Unix cleaning the Cleaning Cartridge
                                        made with the TSM database releases.
                                        "HL_NAME 1000') (/ff3) functial running
                                        Media for Incremental
                                        Technotes: 144729085;
                                         Programmation to specify way if to
                                         client Access Obforematuse.
                                        - If the ASTicol of node most and files
                                          go to the network paragement.
                                        Ref: ARCHIVE3 licensed supplies:
                                        Msg (- 1--     200 - funnever themserv.
                                        Ref: IBM Textrinate 8-P Dejection

After 40000 iterations
Getting better text wise, but still a way to go


6.Bx4&2W&${17V~]^@M!A0~'K;)                   TSM Extra  Tape tempty; kmove tapes
                                        of the Problem Cartridge Last Based STG is
                                        down to the server shore first
                                        message.
                                        See also: SQL
Migrate                                See ASMObject.)
                                        The backup or "Class" (124 driver".)
                                        TSM server determines concrect.
                                       Stanta contain storage pool drives and
                                        override storage pools which changes being
                                        are expan stop as "dsmserv.diret".
Common IDSHi DELete FormISTGpool        The dealing suffix database buffers in TSM
                                        home through made tape, an inverhate
                                        manual:
                                        ANS1000E 1005, (empty, MOBSIZ Estimate
Domino Collogation)                     So modes process via the server, this parts
                                        its Strugging an Incryption.
                                        ?.3.nexclient/htm/scrpt.hame is
                                        use whether server.
Redbooks with-marge                     To except the return volume from
                                        many contained from whichrestworks are
                                        of 5, so that time over the file (via
                                        Estimatent time) from a file or doing
                                        containing a next device which you mean
                                        expire supsts www.stats.
                                        'REGin LICensage'
Management, list                        Backup hours off site and failure to
                                        field in the vendor c

After 85750 iterations
So this is the final result. Many good words, and almost sentences. Still some odd spelling and some made-up words.

\^?ypult. NOT                          See ... Change also installed by the
                                        devices in ontinitorify the DEFineed Introg
                                        of 2087, NTK>& and guivel backup series in
                                        another has a liber updating on the
                                        address (info volumes,
                                        GPAUSEDSEOCE..* facilities, and as
                                        employed by FILE=Client".
                                        Beware away does not recover mounts in
                                        a line.  Exist systems to be full) started.
                                        Setting in this force: TCPPort
                                        TSM Backup Protection manual:
                                        'Set SERVELernaling User
Common System Status, larger           See: AIX
Policy Space Manager, (DB)           Allowed, so have one compressed among
                                        serial Read into only attached to command
                                        order: example if that the client option
                                        which is metadole.
TCPDirge                                See: Managing the Policy NodeName
Copy                                   TSM 4.2 and the status show deweck. A
                                        considerations that were a firewall
                                        assigned on the hap been done stored be
                                        library 999, ERR, DIRACTIVATE.
                                        (Backed-up DOMain table dsmenix, and
                                        Windows COMMTimeout  Syntax: Format;
                                        CANcel SETWWCEMSI, LOGBFILELOCAPE
                                        Activity Sessions, system, greater
                                        Expire Drapacions Domination.  This

And pasting that text into my editor, I come across a funny way to measure the performance of the RNN:- the less squiggly red lines indicating bad words and spelling mistakes, the better the model is performing ;p

It's interesting to see that the RNN is remebering structure way outside of the given 320 sequence length, and to see how it moves closer and closer to the correct formatting over time. It still ends up generating nonsense words, but I find it interesting they often feel like they should be real words!

So there we go. Thats was adventure number 1. Along with general reading and understanding, for the next adventure I want to look into actually creating my own RNN in torch + LUA to try achieving a similar result, see if different RNN configurations do better or worse. Or if that line of questioning even makes sense... !

Have fun!