Monday, September 12, 2016

Machine Learning Adventures 1.5 - Sidetracked by a shiny thing, and time to stop playing and figure out what's going on..

Following on from my previous post, my aim was to start digging into torch and looking at building some basic RNN's to get a better understanding of whats going on.


And then something shiny turned up and I got distracted - after showing a friend the Quckfacts output he piped up 'What happens if we do it with audio?'


Then followed a couple of late nights hacking together some way of transforming audio data into something we could feed a character-predicting RNN, and then feeding and tweaking the RNN training process just to see what would happen, and the realization that to get into this properly means more than just playing about with RNN's for fun and giggles.


Before I dig into that though, I want to go into a little more detail as to whats actually going on when we setup these toy datasets and feed them into the pre-built RNN we are using for character prediction. (Disclaimer; this is all from my current understanding, and I freely admit it might not all be quite right. Hopefully this blog will be all about eventually getting it spot on though, so bear with me..)


What is a Recurrent Neural Network?
My current explanation would be, 'It's like a normal Neural Network, but each of the neurons is capable of maintaining an element of memory, so each iteration of data flowing through the network can be influenced by the history of data that has previously flowed through the network' - Whereas an expert might say

'A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior.'

In practice, this means that RNN's can be very good when working with structured sequences of data - like text. For example, when fed with programming code, an RNN is capable of 'understanding' that it should close a bracket it opened 100 characters ago, or de-dent an indent for a block of code. When fed with english text it is capable of producing coherent sequences of letters as words - this is the 'dynamic temporal behaviour'.


So what does a Neural Network do when it is learning?
Essentially, the network is given input and produces output, this output is compared to the correct output and the network is adjusted in an attempt to become more accurate the next time it produces output. The process of adjusting the networks internals is called 'Back propagation'. For each set of inputs fed into the network, it produces some output. This output is then compared to the correct output and the difference is used to work backwards through the network, adjusting its internal weights toward a configuration that will produce a more accurate output next time. This process is repeated many times until hopefully the networks outputs converge to the correct outputs. The function used to measure the networks accuracy based on a given output is called the Loss function, and the function used to compute the changes made to each neurons weights is called Gradient Descent. The application of Gradient Descent back through the network is called Back Propagation.


So, RNN's have an internal memory, and are good at working with sequences of data whose elements have some relationship to the elements that came before them. This seemed to us to be ideal when applied to audio data. Audio data is inherently made up of cycles over time - waveforms of different frequencies piled up on top of each other to produce the signal we actually hear. So we thought we would give it a try..


We took a naive approach, and decided to look at the audio data in the time domain as a sequence of raw 16 bit samples, where an audio signal in the range -1 -> 1 is mapped to the range 0 - 65535. This meant we could map the 0-65535 possible values of a sample point to a UTF-8 code point, enabling us to produce a single unique symbol -AKA 'a Letter' for each of the values a sample could take, that we could then feed into the character predicting RNN.

We recorded some human speech and exported the data from plogue (audio / dsp software that lets you work at a very low level graphically to build up DSP networks from fundamental components) as raw integer sample values in the range 1-65536.

We set about hacking together a couple of python scripts to go from int16<-utf8->int16

There was a problem; within the UTF-8 set there is a range of characters that are expressed as surrogate pairs - where two individual code points are combined to produce a single character. This was causing problems for the pre-process step required by the char_level rnn. When it hit the UTF codes for the surrogate pairs it would fail. We got around this by 'cleaning' the generated UTF data and forcing all the surrogate pair code points to a different range. This would be mangling the audio data being represented a little, but not enough to cause any real issues. (It would be the equivalent of adding in a couple of clicks and pops to the original audio signal)

So, with the data now 'cleaned' we repeated the training and sampling process used for training against text-based data.

We did get results. Nothing approaching human speech though - the output was essentially very very noisy, we likened it to an uncovered mic in a hurricane. Now, this wasn't a failure, the audio did have some underlying structure (wasn't just noise) that you could imagine was the beginnings of the RNN generating some kind of speech-like structures, but really that could just be audio pareidolia.

We tried tweaking some parameters we could find. But we just seemed to be making it worse. As we did so though, we began to form some idea of what was going on. Especially there was a number we could adjust called the 'Learning rate', and a way to adjust the learning rate over time - we could make the learning rate decrease by a given factor every N 'Epochs' of training. We seemed to be getting better results by making the learning rate smaller, though with no apparent consistency. After a few hours of tweaking and staring at columns of numbers creeping up the screen, we came to the conclusion we didn't really understand what was going on.

Which, we had to admit, past some fuzzy ideas gleaned from skimming the results of google searches.. we didn't. We had some idea that essentially we were jiggling the weights of the network about in some way related to the learning rate, and maybe the higher the rate, the more it was jiggling. And we were trying to get the weights to converge on a 'correct' configuration. Maybe, we thought, if learning was too high it was jiggling too much, and might be jiggling back out of any patterns it found itself in. But then.. if we didn't jiggle it enough, maybe it would never find the convergence we were hoping for...

It turns out, we were not so far from the truth. But also a long way from it :)



At this point, we had hit the limits of 'playing with RNNs for fun' and needed to start looking properly at what was going on in an attempt to understand how we could get better results from the RNN.

While training, the training process outputs the current 'loss' of the network. It turns out as the network trains, we hope the network loss gets smaller - this would indicate a succesful convergence of the network weights to a 'correct' pattern. But for our attempts to work with audio data, this wasn't happening.

This graph illustrates what we were seeing, black is the ideal situation. Green shows how the loss moved when working with english text, and red shows what we were seeing with our audio-to-utf data;


But why? Our sequence of symbols surely is conceptually the same as text? - ok, the underlying structure was different, but the network isn't to know our string of symbols might not be valid text in some strange language somewhere?
How is the loss function being calculated?, just what is it really representing?, and why is it going so horribly wrong with our data?

It turns out to begin to understand this, we need to understand what the 'Loss' actually represents, how something called the 'Learning Rate' affects it, and just what 'Gradient Descent' is and how it applies to neural network learning.

And that will (all?) be in the next blog..

No comments:

Post a Comment