Regular Expressions.: Machine Learning Adventure 1 - first steps, using a char predicting LSTM RNN with torch-rnn

So, recently I've had the opportunity to start looking into some of the cognitive services being offered by IBM / Google / MS Etc. and this has rekindled my interest in Neural Nets and what is being called 'Deep Learning'. I first heard about Neural Nets as a teenager around the late 80's / early 90's, and while in those days the concept of a neural net was super cool the results were not always great and the mathematical complexity was beyond me. Things have improved a lot since then. Over the last 5 years NN's have made great strides in accuracy. Toolsets, frameworks and services have sprung up to make working with NN's and the services using them much easier to get to grips with from a developer point of view. The maths though, well as far as I am concerned thats still over my head, but is one of the things I hope this little adventure will help with ;)

So I've decided to go deeper into NN's and how they work in deep learning applications. I'm starting with a vague understanding of some of the simpler kinds of NN, some vague idea of terms like back-propagation, and a fair amount of experience as a developer.

Hunting around in general for information on the new fangled 'Recurrent Neural Networks', I came across this post; http://karpathy.github.io/2015/05/21/rnn-effectiveness/ ( Thats a good post to read to get an overview of what an RNN is compared to a 'regular' NN) and this docker container https://github.com/crisbal/docker-torch-rnn and thought I would have a stab at building my-first-rnn / a char level rnn that you can feed text, train, and then get it to spit out examples of text in a similar style. For my first steps though, I will be happy with getting the pre-built RNN in torch-rnn up and running using the GPU to train against some text.

This blog is going to be more bout the technical steps I took; I would advise you to go read the pages linked above and other info on exactly what's occurring in terms of the RNN - I'm not quite qualified enough to be giving decent explanations to that yet! - take most of what I say with a pinch of salt. I'm learning.

So, first things first, I needed to setup a machine to use for all this. I built a beefy machine a while ago when I decided to setup a local hadoop cluster in VMWare redhat guests, because hadoop was all the rage at the time and I wanted to understand it. That worked, but it proved quite difficult to phrase the problem case I had in the map-reduce paradigm (I was playing with Genetic Algorithms for playing GO and wanted to use hadoop to evaluate massive populations in parallel ((it turned out easier to just run as many non-parallel process as I could squeeze onto the CPU cores + RAM I had available and merge populations periodically - but the whole hadoop setup was fun, frustrating and enlightening regardless)).

My desktop was fitted with the following:

* i7-970 @ 3.2ghz / 6-Core, watercooled

* 18GB RAM

* 512GB SSD for OS

* 4TB striped volume set (risky, but I backup the important data for when one of them eventually fails), I did this to get in a similar ballpark to SSD read/write speeds, without paying for 4TB of SSD.

* GeForce 780ti - this is a great gaming GFX card! and has aged fantastically. It also turns out to be great for starting out in deep learning, with great memory bandwidth (336GB/Sec - bandwidth is one of the major indications of performance in deep learning, see http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/ - but falls short a little due to it's architecture, and is restricted by only having 3GB RAM, but thats Ok for starting out)

For step one of this adventure, I wanted to get everything setup and working, and get the pre-built RNN in the torch-rnn container spitting out some text in the style of something other than shakespeare. It's important to understand that it really is 'text in the style of' - there is no understanding of the words and text being generated, so while it looks ok from a distance, actually reading it quickly highlights the fact the actual content is gibberish.

I gave the machine a fresh install of Ubuntu 16.04 LTS, and then installed;

* general apt-get update & upgrade to be all up to date with the basic system.

* Proprietary NVIDIA drivers (required for NVIDIA CUDA to work, this enables GPU acceleration when running the RNN) / with help from http://www.howtogeek.com/242045/how-to-get-the-latest-nvidia-amd-or-intel-graphics-drivers-on-ubuntu/

* mesa-utils - this was just to get glxgears to check NVIDIA drivers were working.

* NVIDIA CUDA drivers / with help from https://devtalk.nvidia.com/default/topic/926383/testing-ubuntu-16-04-for-cuda-development-awesome-integration-/ - tip, just try and install all the stuff in the first table on the page. It all worked first time with apt-get for me. (adjust for various newer versions if you are reading this in the future..)

* Docker engine - just followed the steps for installing on ubuntu from the docker site. Worked first time. (almost, some packages in the documentation for docker are given with a -lts-trusy prefix. I could not install/find these until I removed the -lts-trusty prefix, so look out for that (near the end))

* nvidia-docker - from the github https://github.com/NVIDIA/nvidia-docker/wiki/Installation

I was pleasantly surprised how painless doing all of that was. The worst part was trying to use Ubuntu's partitioning tool to setup disk + bootloader for the install. I must have been missing something but I could not create more than one partition on the disk for install+swap, so ended up with an Ubuntu install with 0 swap space. I have 18GB RAM so I'll probably be ok for a while but I can see some disk shenanigans coming up in the future to add on some swap space and split up the root partition some.

I decided to use docker for a couple of reasons; it's also something I am learning at the moment, and I had seen torch-rnn, a container setup that makes it easy to play with a char-level RNN for generating text. I could go ahead without docker and setup torch and dependencies, but for ease-of-use I decided to just go with containers at first. I wanted to concentrate on doing RNN things, rather than continuing to install even more stuff + learning torch & LUA from scratch. The container gives a nice example set of LUA scripts to poke into and learn from also.

We're almost done! Everything installed fine, and using docker + torch-rnn made actually training an RNN for text generation nice and easy.

I grabbed the torch-rnn container and started it with nvidia-docker, then started a bash shell in the container.

Now, I needed some text to train with. Because of my day job, I knew of this; http://people.bu.edu/rbs/ADSM.QuickFacts This is an amazing body of text built up of useful IBM Spectrum Protect (prev. Tivoli Storage Manager / ADSM) snippets of information. I wanted to train my network to spit out an imitation Spectrum Protect QuickFacts in the style of Richard Sims. It's about 5MB of text.

The quickfacts also have an interesting property I was wondering whether the RNN would learn; while its a plain text 80-column file, it is formatted (mostly) as two 40 char columns, where the left column is headings, and the right column is body text. This means each 80char line usually consists of {40 spaces}{some text} and the body text can be 1 to 10's of lines long. Because of the way LSTM RNN's work, they have a memory that theoretically limits the size (number of chars) of structures they can learn, and the structuring of the quickfacts involves elements much longer than this memory (or rather, too long to train with long enough memories on my setup) - so it will be interesting to see if the RNN learns the structure in spite of this. Theres a good paper here analyzing RNN's, how they perform and the errors they can produce - https://arxiv.org/pdf/1506.02078.pdf - The maths is over my head, but the text + pictures I can (mostly) follow in general.

So, once I had copied the quickfacts text file into the container, I set about applying the RNN training to it;

First the input data is preprocessed,

root@1540b0057079:~/torch-rnn# python preprocess.py --input_file data/qfacts.txt --output_h5 data/qfacts.h5 --output_json qfacts.json

Total vocabulary size: 97
Total tokens in file: 4900717
  Training size: 3920575
  Val size: 490071
  Test size: 490071
Using dtype

Then the training occurs,

root@1540b0057079:~/torch-rnn# th train.lua -input_h5 data/qfacts.h5 -input_json data/qfacts.json -seq_length 320 -rnn_size 256 -max_epochs 350

Running with CUDA on GPU 0 
Epoch 1.00 / 350, i = 1 / 85750, loss = 4.591961 
Epoch 1.01 / 350, i = 2 / 85750, loss = 3.648895 
Epoch 1.01 / 350, i = 3 / 85750, loss = 2.825536 
Epoch 1.02 / 350, i = 4 / 85750, loss = 2.399451 
Epoch 1.02 / 350, i = 5 / 85750, loss = 2.337098 
Epoch 1.02 / 350, i = 6 / 85750, loss = 2.201095 
Epoch 1.03 / 350, i = 7 / 85750, loss = 2.175408 
Epoch 1.03 / 350, i = 8 / 85750, loss = 2.076128 
Epoch 1.04 / 350, i = 9 / 85750, loss = 2.076714

I changed three parameters from the defaults given to train the tiny-shakepspeare dataset that comes with the container.

-seq_length from 50 to 320, my current understanding is this relates to the 'memory length' the RNN should exhibit, I changed to 320 to it covers 4 lines of 80 chars.

-rnn_size from 128 to 256, under advice I read that with a larger dataset for training, this change should improve generating performance. I haven't yet played enough with these or other parameters to really see how they effect the output.

-max_epochs to 350 This essentially determines how long the model will train for. You can get interesting results with less epochs, but training for longer improves things, though there appears to be a tradeoff where going much longer gets smaller improvements.

I did a little bit of quick testing too, with gpu enabled (default), and gpu disabled ( -gpu -1 ) to get a quick idea of the performance difference. Using the GPU seemed to give a speed improvement of ~15x , though I'm not too sure on the accuracy of that because generally speedups are quoted as being in the 5x - 10x range. This speedup let me choose 350 Epochs and have it trained in 8 hours. Using 50 Epochs on a CPU took around 16 hours. In the end, the results were not too different in quality, so I think there are other parameter tweaks I can make (related to gradients and some other stuff I'm not 100% sure about yet) to improve accuracy.

So, 8 hours later I had a trained LSTM RNN that should be able to imitate Richard Sim's Quickfacts.

Here's what I got - I'm going to put several examples, each from different stages through training.

First, here's an example from the actual quickfacts document, this is what we are aiming to imitate;

Immediate Client Actions utility        After using, stop and restart the
                                        scheduler service on the client, so it
                                        can query the server to find out it's
                                        next schedule, which in this case would
                                        the immediate action you created.
                                        Otherwise you will need to wait till the
                                        client checks for its next schedule on
                                        its own. Also affected by the server
                                        'Set RANDomize' command.
"Imperfect collocation"                 A phrase that might be used to describe
                                        the condition that occurs when
                                        collocation is enabled, but there are
                                        insufficient scratch tapes to maintain
                                        full separation of data, such that data
                                        which otherwise would be kept separate
                                        has to be mingled within remaining
                                        volume space.
                                        See:  Collocation
Import                                  To import into a TSM server the
                                        definitions and/or data from another
                                        server where an Export had been done.

The training produces several checkpoint files you can use to sample the networks output at different points in the learning process. The general form of the command to sample the RNN for output is;

th sample.lua -checkpoint {path to checkpoint file} -length {number of characters}
For the example outputs below I chose a length of 2000 characters.

Here's the sampled RNN output;

After 1000 iterations of 85750
Formatting is pretty loose. Plenty of bad words.

?_{WA        See also: Otil| -
                                           lease tape of MEXif ANDEXvia'
Set space stubstobackups                  ckpection mesmigr via through
                                        elarage suble an  <Unix client -racle).
Set SEROCTib=Persages                         Data cant spen valuable.
                                        Due a logical that roing are more
                                        when the if about access if a emrege
                                       sessions writing of 1., begin 3 - the
                                          only - which check thif justy in put.
                                         [Data
                                        Yessward Helingshom Loampp"
                                        storage.
Client server a Cindog files,         /usr/lpp//bloctif, SHow (KH2)
                                                'Query lsg or ax 4 STGPool Statu>
                                          Ref:", the format of a Query Query FIle
                                        privation data will including in seisionsing
                                         storage pools, and abood thats it thre
                                         library and dsm.- which missible as
                                          in the database in a has on each options
                                        if does not after currogs run is
                                         search volume and you doing in anaged
                                        with the node -- All tables* use the
                                         filespile (with force from when the
                                        waiting error, command: there is
                                         host particulard or diskupdate all
                                        libraries "Passic client (stalled.
                                        Storage HODS, Poll averded, the Table

After 2000 iterations
Already the formatting is tightening up, more words are actually words, there's some remembering to close open brackets.

TSM 4:2 (
                                     The tape or cells and causes hished
                                        processing ended in the Maciffites, or
                                        "ACRESTSIcatus, OVERetrieve in that
                                        logging "TCP.) storage logged).
                                        Had locks with TSM client seft. <The
                                        performance of tape. Changing examine,
                                        allows the security in *note has a givaney
                                        tracknocal offsite comes, and the TSM
                                        into a loss numerical perhaper.
dsmrl___DIFIces overb (escond (in archive       4, FAUName volume for TDP
                                          WUTLE OS15304. To four/up no thrsel,
                                         Interface, it has being reduced via:
                                         volume storage and which the password,
                                         files: for example is a file can be
                                         +rune on "Nurbether Files, third signial
                                         server option.  Respeate time, checkin
                                         afters to relieve the statistic to a
                                        the backup of during runs as etclly
                                        messages, affecting zight users it is
                                          started file system may be multiple
                                         Lix FILED furth, char, increase for
                                        to a Storage Pool, up no library, as
                                        (attribute. Activity Console, using the
                                         over *SM issuacialization).
client                                  Ref: INGEDEdec server 1591 bytes).
                                        Loggose "fixed nam

After 10000 iterations
Very good line formatting (IE, 40 spaces then text) even more real words, not many rubbish strings, even some convincing sounding made up words!

7$’gi#       dsmserv -> -F sized - but, the
                                        Firch Sizk for TSM, the tape product of
                                        invocation cannot stop as then holds
                                        be confumed by Ultriumes and then full
                                        by
/mmarb/htp, 10, VROUs(s)                TSM current nodes of the file STGpool and
                                        protection of the '__
                                        IBM Technote: 1008746
/taperate>
                                       SELEckgen Password Vunus Incremental
                                        a file parametel TSM system available
                                        a kegpormance and prompt.
                                        Client system being butted to list archived
                                        wildcards, backup send upon the timeout
                                        operating system type for veriedes or
                                        is including a LAN-free backup.
HSM_CAS,                                Unix cleaning the Cleaning Cartridge
                                        made with the TSM database releases.
                                        "HL_NAME 1000') (/ff3) functial running
                                        Media for Incremental
                                        Technotes: 144729085;
                                         Programmation to specify way if to
                                         client Access Obforematuse.
                                        - If the ASTicol of node most and files
                                          go to the network paragement.
                                        Ref: ARCHIVE3 licensed supplies:
                                        Msg (- 1--     200 - funnever themserv.
                                        Ref: IBM Textrinate 8-P Dejection

After 40000 iterations
Getting better text wise, but still a way to go


6.Bx4&2W&${17V~]^@M!A0~'K;)                   TSM Extra  Tape tempty; kmove tapes
                                        of the Problem Cartridge Last Based STG is
                                        down to the server shore first
                                        message.
                                        See also: SQL
Migrate                                See ASMObject.)
                                        The backup or "Class" (124 driver".)
                                        TSM server determines concrect.
                                       Stanta contain storage pool drives and
                                        override storage pools which changes being
                                        are expan stop as "dsmserv.diret".
Common IDSHi DELete FormISTGpool        The dealing suffix database buffers in TSM
                                        home through made tape, an inverhate
                                        manual:
                                        ANS1000E 1005, (empty, MOBSIZ Estimate
Domino Collogation)                     So modes process via the server, this parts
                                        its Strugging an Incryption.
                                        ?.3.nexclient/htm/scrpt.hame is
                                        use whether server.
Redbooks with-marge                     To except the return volume from
                                        many contained from whichrestworks are
                                        of 5, so that time over the file (via
                                        Estimatent time) from a file or doing
                                        containing a next device which you mean
                                        expire supsts www.stats.
                                        'REGin LICensage'
Management, list                        Backup hours off site and failure to
                                        field in the vendor c

After 85750 iterations
So this is the final result. Many good words, and almost sentences. Still some odd spelling and some made-up words.

\^?ypult. NOT                          See ... Change also installed by the
                                        devices in ontinitorify the DEFineed Introg
                                        of 2087, NTK>& and guivel backup series in
                                        another has a liber updating on the
                                        address (info volumes,
                                        GPAUSEDSEOCE..* facilities, and as
                                        employed by FILE=Client".
                                        Beware away does not recover mounts in
                                        a line.  Exist systems to be full) started.
                                        Setting in this force: TCPPort
                                        TSM Backup Protection manual:
                                        'Set SERVELernaling User
Common System Status, larger           See: AIX
Policy Space Manager, (DB)           Allowed, so have one compressed among
                                        serial Read into only attached to command
                                        order: example if that the client option
                                        which is metadole.
TCPDirge                                See: Managing the Policy NodeName
Copy                                   TSM 4.2 and the status show deweck. A
                                        considerations that were a firewall
                                        assigned on the hap been done stored be
                                        library 999, ERR, DIRACTIVATE.
                                        (Backed-up DOMain table dsmenix, and
                                        Windows COMMTimeout  Syntax: Format;
                                        CANcel SETWWCEMSI, LOGBFILELOCAPE
                                        Activity Sessions, system, greater
                                        Expire Drapacions Domination.  This

And pasting that text into my editor, I come across a funny way to measure the performance of the RNN:- the less squiggly red lines indicating bad words and spelling mistakes, the better the model is performing ;p

It's interesting to see that the RNN is remebering structure way outside of the given 320 sequence length, and to see how it moves closer and closer to the correct formatting over time. It still ends up generating nonsense words, but I find it interesting they often feel like they should be real words!

So there we go. Thats was adventure number 1. Along with general reading and understanding, for the next adventure I want to look into actually creating my own RNN in torch + LUA to try achieving a similar result, see if different RNN configurations do better or worse. Or if that line of questioning even makes sense... !

Have fun!

Regular Expressions.

Saturday, September 3, 2016

Machine Learning Adventure 1 - first steps, using a char predicting LSTM RNN with torch-rnn

No comments:

Post a Comment