So I've decided to go deeper into NN's and how they work in deep learning applications. I'm starting with a vague understanding of some of the simpler kinds of NN, some vague idea of terms like back-propagation, and a fair amount of experience as a developer.
Hunting around in general for information on the new fangled 'Recurrent Neural Networks', I came across this post; http://karpathy.github.io/2015/05/21/rnn-effectiveness/ ( Thats a good post to read to get an overview of what an RNN is compared to a 'regular' NN) and this docker container https://github.com/crisbal/docker-torch-rnn and thought I would have a stab at building my-first-rnn / a char level rnn that you can feed text, train, and then get it to spit out examples of text in a similar style. For my first steps though, I will be happy with getting the pre-built RNN in torch-rnn up and running using the GPU to train against some text.
This blog is going to be more bout the technical steps I took; I would advise you to go read the pages linked above and other info on exactly what's occurring in terms of the RNN - I'm not quite qualified enough to be giving decent explanations to that yet! - take most of what I say with a pinch of salt. I'm learning.
So, first things first, I needed to setup a machine to use for all this. I built a beefy machine a while ago when I decided to setup a local hadoop cluster in VMWare redhat guests, because hadoop was all the rage at the time and I wanted to understand it. That worked, but it proved quite difficult to phrase the problem case I had in the map-reduce paradigm (I was playing with Genetic Algorithms for playing GO and wanted to use hadoop to evaluate massive populations in parallel ((it turned out easier to just run as many non-parallel process as I could squeeze onto the CPU cores + RAM I had available and merge populations periodically - but the whole hadoop setup was fun, frustrating and enlightening regardless)).
My desktop was fitted with the following:
* i7-970 @ 3.2ghz / 6-Core, watercooled
* 18GB RAM
* 512GB SSD for OS
* 4TB striped volume set (risky, but I backup the important data for when one of them eventually fails), I did this to get in a similar ballpark to SSD read/write speeds, without paying for 4TB of SSD.
* GeForce 780ti - this is a great gaming GFX card! and has aged fantastically. It also turns out to be great for starting out in deep learning, with great memory bandwidth (336GB/Sec - bandwidth is one of the major indications of performance in deep learning, see http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/ - but falls short a little due to it's architecture, and is restricted by only having 3GB RAM, but thats Ok for starting out)
For step one of this adventure, I wanted to get everything setup and working, and get the pre-built RNN in the torch-rnn container spitting out some text in the style of something other than shakespeare. It's important to understand that it really is 'text in the style of' - there is no understanding of the words and text being generated, so while it looks ok from a distance, actually reading it quickly highlights the fact the actual content is gibberish.
I gave the machine a fresh install of Ubuntu 16.04 LTS, and then installed;
* general apt-get update & upgrade to be all up to date with the basic system.
* Proprietary NVIDIA drivers (required for NVIDIA CUDA to work, this enables GPU acceleration when running the RNN) / with help from http://www.howtogeek.com/242045/how-to-get-the-latest-nvidia-amd-or-intel-graphics-drivers-on-ubuntu/
* mesa-utils - this was just to get glxgears to check NVIDIA drivers were working.
* NVIDIA CUDA drivers / with help from https://devtalk.nvidia.com/default/topic/926383/testing-ubuntu-16-04-for-cuda-development-awesome-integration-/ - tip, just try and install all the stuff in the first table on the page. It all worked first time with apt-get for me. (adjust for various newer versions if you are reading this in the future..)
* Docker engine - just followed the steps for installing on ubuntu from the docker site. Worked first time. (almost, some packages in the documentation for docker are given with a -lts-trusy prefix. I could not install/find these until I removed the -lts-trusty prefix, so look out for that (near the end))
* nvidia-docker - from the github https://github.com/NVIDIA/nvidia-docker/wiki/Installation
I was pleasantly surprised how painless doing all of that was. The worst part was trying to use Ubuntu's partitioning tool to setup disk + bootloader for the install. I must have been missing something but I could not create more than one partition on the disk for install+swap, so ended up with an Ubuntu install with 0 swap space. I have 18GB RAM so I'll probably be ok for a while but I can see some disk shenanigans coming up in the future to add on some swap space and split up the root partition some.
I decided to use docker for a couple of reasons; it's also something I am learning at the moment, and I had seen torch-rnn, a container setup that makes it easy to play with a char-level RNN for generating text. I could go ahead without docker and setup torch and dependencies, but for ease-of-use I decided to just go with containers at first. I wanted to concentrate on doing RNN things, rather than continuing to install even more stuff + learning torch & LUA from scratch. The container gives a nice example set of LUA scripts to poke into and learn from also.
We're almost done! Everything installed fine, and using docker + torch-rnn made actually training an RNN for text generation nice and easy.
I grabbed the torch-rnn container and started it with nvidia-docker, then started a bash shell in the container.
Now, I needed some text to train with. Because of my day job, I knew of this; http://people.bu.edu/rbs/ADSM.QuickFacts This is an amazing body of text built up of useful IBM Spectrum Protect (prev. Tivoli Storage Manager / ADSM) snippets of information. I wanted to train my network to spit out an imitation Spectrum Protect QuickFacts in the style of Richard Sims. It's about 5MB of text.
The quickfacts also have an interesting property I was wondering whether the RNN would learn; while its a plain text 80-column file, it is formatted (mostly) as two 40 char columns, where the left column is headings, and the right column is body text. This means each 80char line usually consists of {40 spaces}{some text} and the body text can be 1 to 10's of lines long. Because of the way LSTM RNN's work, they have a memory that theoretically limits the size (number of chars) of structures they can learn, and the structuring of the quickfacts involves elements much longer than this memory (or rather, too long to train with long enough memories on my setup) - so it will be interesting to see if the RNN learns the structure in spite of this. Theres a good paper here analyzing RNN's, how they perform and the errors they can produce - https://arxiv.org/pdf/1506.02078.pdf - The maths is over my head, but the text + pictures I can (mostly) follow in general.
So, once I had copied the quickfacts text file into the container, I set about applying the RNN training to it;
First the input data is preprocessed,
root@1540b0057079:~/torch-rnn# python preprocess.py --input_file data/qfacts.txt --output_h5 data/qfacts.h5 --output_json qfacts.json
Total vocabulary size: 97 Total tokens in file: 4900717 Training size: 3920575 Val size: 490071 Test size: 490071 Using dtypeThen the training occurs,
root@1540b0057079:~/torch-rnn# th train.lua -input_h5 data/qfacts.h5 -input_json data/qfacts.json -seq_length 320 -rnn_size 256 -max_epochs 350
Running with CUDA on GPU 0 Epoch 1.00 / 350, i = 1 / 85750, loss = 4.591961 Epoch 1.01 / 350, i = 2 / 85750, loss = 3.648895 Epoch 1.01 / 350, i = 3 / 85750, loss = 2.825536 Epoch 1.02 / 350, i = 4 / 85750, loss = 2.399451 Epoch 1.02 / 350, i = 5 / 85750, loss = 2.337098 Epoch 1.02 / 350, i = 6 / 85750, loss = 2.201095 Epoch 1.03 / 350, i = 7 / 85750, loss = 2.175408 Epoch 1.03 / 350, i = 8 / 85750, loss = 2.076128 Epoch 1.04 / 350, i = 9 / 85750, loss = 2.076714
I changed three parameters from the defaults given to train the tiny-shakepspeare dataset that comes with the container.
-seq_length from 50 to 320, my current understanding is this relates to the 'memory length' the RNN should exhibit, I changed to 320 to it covers 4 lines of 80 chars.
-rnn_size from 128 to 256, under advice I read that with a larger dataset for training, this change should improve generating performance. I haven't yet played enough with these or other parameters to really see how they effect the output.
-max_epochs to 350 This essentially determines how long the model will train for. You can get interesting results with less epochs, but training for longer improves things, though there appears to be a tradeoff where going much longer gets smaller improvements.
I did a little bit of quick testing too, with gpu enabled (default), and gpu disabled ( -gpu -1 ) to get a quick idea of the performance difference. Using the GPU seemed to give a speed improvement of ~15x , though I'm not too sure on the accuracy of that because generally speedups are quoted as being in the 5x - 10x range. This speedup let me choose 350 Epochs and have it trained in 8 hours. Using 50 Epochs on a CPU took around 16 hours. In the end, the results were not too different in quality, so I think there are other parameter tweaks I can make (related to gradients and some other stuff I'm not 100% sure about yet) to improve accuracy.
So, 8 hours later I had a trained LSTM RNN that should be able to imitate Richard Sim's Quickfacts.
Here's what I got - I'm going to put several examples, each from different stages through training.
First, here's an example from the actual quickfacts document, this is what we are aiming to imitate;
Immediate Client Actions utility After using, stop and restart the scheduler service on the client, so it can query the server to find out it's next schedule, which in this case would the immediate action you created. Otherwise you will need to wait till the client checks for its next schedule on its own. Also affected by the server 'Set RANDomize' command. "Imperfect collocation" A phrase that might be used to describe the condition that occurs when collocation is enabled, but there are insufficient scratch tapes to maintain full separation of data, such that data which otherwise would be kept separate has to be mingled within remaining volume space. See: Collocation Import To import into a TSM server the definitions and/or data from another server where an Export had been done.
The training produces several checkpoint files you can use to sample the networks output at different points in the learning process. The general form of the command to sample the RNN for output is;
th sample.lua -checkpoint {path to checkpoint file} -length {number of characters}
For the example outputs below I chose a length of 2000 characters.
Here's the sampled RNN output;
After 1000 iterations of 85750
Formatting is pretty loose. Plenty of bad words.
?_{WA See also: Otil| - lease tape of MEXif ANDEXvia' Set space stubstobackups ckpection mesmigr via through elarage suble an <Unix client -racle). Set SEROCTib=Persages Data cant spen valuable. Due a logical that roing are more when the if about access if a emrege sessions writing of 1., begin 3 - the only - which check thif justy in put. [Data Yessward Helingshom Loampp" storage. Client server a Cindog files, /usr/lpp//bloctif, SHow (KH2) 'Query lsg or ax 4 STGPool Statu> Ref:", the format of a Query Query FIle privation data will including in seisionsing storage pools, and abood thats it thre library and dsm.- which missible as in the database in a has on each options if does not after currogs run is search volume and you doing in anaged with the node -- All tables* use the filespile (with force from when the waiting error, command: there is host particulard or diskupdate all libraries "Passic client (stalled. Storage HODS, Poll averded, the Table
After 2000 iterations
Already the formatting is tightening up, more words are actually words, there's some remembering to close open brackets.
TSM 4:2 ( The tape or cells and causes hished processing ended in the Maciffites, or "ACRESTSIcatus, OVERetrieve in that logging "TCP.) storage logged). Had locks with TSM client seft. <The performance of tape. Changing examine, allows the security in *note has a givaney tracknocal offsite comes, and the TSM into a loss numerical perhaper. dsmrl___DIFIces overb (escond (in archive 4, FAUName volume for TDP WUTLE OS15304. To four/up no thrsel, Interface, it has being reduced via: volume storage and which the password, files: for example is a file can be +rune on "Nurbether Files, third signial server option. Respeate time, checkin afters to relieve the statistic to a the backup of during runs as etclly messages, affecting zight users it is started file system may be multiple Lix FILED furth, char, increase for to a Storage Pool, up no library, as (attribute. Activity Console, using the over *SM issuacialization). client Ref: INGEDEdec server 1591 bytes). Loggose "fixed nam
After 10000 iterations
Very good line formatting (IE, 40 spaces then text) even more real words, not many rubbish strings, even some convincing sounding made up words!
7$’gi# dsmserv -> -F sized - but, the Firch Sizk for TSM, the tape product of invocation cannot stop as then holds be confumed by Ultriumes and then full by /mmarb/htp, 10, VROUs(s) TSM current nodes of the file STGpool and protection of the '__ IBM Technote: 1008746 /taperate> SELEckgen Password Vunus Incremental a file parametel TSM system available a kegpormance and prompt. Client system being butted to list archived wildcards, backup send upon the timeout operating system type for veriedes or is including a LAN-free backup. HSM_CAS, Unix cleaning the Cleaning Cartridge made with the TSM database releases. "HL_NAME 1000') (/ff3) functial running Media for Incremental Technotes: 144729085; Programmation to specify way if to client Access Obforematuse. - If the ASTicol of node most and files go to the network paragement. Ref: ARCHIVE3 licensed supplies: Msg (- 1-- 200 - funnever themserv. Ref: IBM Textrinate 8-P DejectionAfter 40000 iterations
Getting better text wise, but still a way to go
6.Bx4&2W&${17V~]^@M!A0~'K;) TSM Extra Tape tempty; kmove tapes of the Problem Cartridge Last Based STG is down to the server shore first message. See also: SQL Migrate See ASMObject.) The backup or "Class" (124 driver".) TSM server determines concrect. Stanta contain storage pool drives and override storage pools which changes being are expan stop as "dsmserv.diret". Common IDSHi DELete FormISTGpool The dealing suffix database buffers in TSM home through made tape, an inverhate manual: ANS1000E 1005, (empty, MOBSIZ Estimate Domino Collogation) So modes process via the server, this parts its Strugging an Incryption. ?.3.nexclient/htm/scrpt.hame is use whether server. Redbooks with-marge To except the return volume from many contained from whichrestworks are of 5, so that time over the file (via Estimatent time) from a file or doing containing a next device which you mean expire supsts www.stats. 'REGin LICensage' Management, list Backup hours off site and failure to field in the vendor cAfter 85750 iterations
So this is the final result. Many good words, and almost sentences. Still some odd spelling and some made-up words.
\^?ypult. NOT See ... Change also installed by the devices in ontinitorify the DEFineed Introg of 2087, NTK>& and guivel backup series in another has a liber updating on the address (info volumes, GPAUSEDSEOCE..* facilities, and as employed by FILE=Client". Beware away does not recover mounts in a line. Exist systems to be full) started. Setting in this force: TCPPort TSM Backup Protection manual: 'Set SERVELernaling User Common System Status, larger See: AIX Policy Space Manager, (DB) Allowed, so have one compressed among serial Read into only attached to command order: example if that the client option which is metadole. TCPDirge See: Managing the Policy NodeName Copy TSM 4.2 and the status show deweck. A considerations that were a firewall assigned on the hap been done stored be library 999, ERR, DIRACTIVATE. (Backed-up DOMain table dsmenix, and Windows COMMTimeout Syntax: Format; CANcel SETWWCEMSI, LOGBFILELOCAPE Activity Sessions, system, greater Expire Drapacions Domination. This
And pasting that text into my editor, I come across a funny way to measure the performance of the RNN:- the less squiggly red lines indicating bad words and spelling mistakes, the better the model is performing ;p
It's interesting to see that the RNN is remebering structure way outside of the given 320 sequence length, and to see how it moves closer and closer to the correct formatting over time. It still ends up generating nonsense words, but I find it interesting they often feel like they should be real words!
So there we go. Thats was adventure number 1. Along with general reading and understanding, for the next adventure I want to look into actually creating my own RNN in torch + LUA to try achieving a similar result, see if different RNN configurations do better or worse. Or if that line of questioning even makes sense... !
Have fun!
No comments:
Post a Comment