Machine Learning

Optimizing your Neural Nets

There are a lot of hyperparameter optimization tools such as SMAC. But I found it exhausting, even frustrating, to set it up for a neural network’s hyperparameter.

I’ve found my favourite technique which whom I feel very confident with because I can have full insight into the procedure and make my own changes.  In addition, I get some additional information about my parameters. The only disadvantage is the time you’d need to spend for the procedure. But I think there are some cases in which it’s worth spending more time in optimization (especially if you are a newbie in hyperparameter search).

So I chose a simple example for optimizing a one-layered recurrent neural network (GRU) with a dense layer from keras on a small dataset:

I iteratively used hyperband, a hyperparameter optimization tool based on successive halving in order to determine the best settings, and fANOVA for assessing hyperparameter importance.

So here an example on how I’d set the following parameter ranges in my configuration space using the  ConfigSpace python module:

import ConfigSpace as CS
cs = CS.ConfigurationSpace()
Ps=[
    CS.CategoricalHyperparameter('activation1', ['tanh', 'linear', 'sigmoid']),
    CS.CategoricalHyperparameter('dense_activation', ['tanh', 'linear', 'sigmoid']),
    CS.CategoricalHyperparameter('rnnunit1', ['8','16','32','64','128']),
    CS.(UniformFloatHyperparameter('dropout', 0,0.35),
    CS.UniformFloatHyperparameter('factor', 0.01,0.2),
    CS.UniformFloatHyperparameter('lr', 0.0001, 0.01),
]
cs.add_hyperparameters(Ps)

Then, I let hyperband perform 100 runs (since I have a small network) on a max budget of 40 epochs in order to search for a good setting by evaluating the validation loss of the GRU. Keep in mind that this procedure might take a while (approx. 1h for small experiments). So have something prepared to do in the meantime 😉

The good thing in hyperband is, that you not only get the most optimal setting it has found; It also gives you the configuration setting and its error of each run. With that information and the Configuration Space, you can easily run fANOVA in order to get an insight into which parameters are the most important for your target algorithm on your specific task and which range to set for those important parameters  in order to narrow down your search space.

So hyperband’s result on this example was:

activation1, Value: ‘linear’
dense_activation, Value: ‘linear’
dropout, Value: 0.17867862209638768
factor, Value: 0.18214403580588892
lr, Value: 0.003463293128993808
rnnunit1, Value: ‘128’

Now it’s time fANOVA comes into play:

Let’s take a look at the importance of our parameters:

activation1, importance: 0.18446806774436139
dense_activation, importance: 0.032622871065467941
dropout, importance: 0.012281772153045982
factor, importance: 0.04186128009938278
lr, importance: 0.11986707130971085
rnnunit1, importance: 0.025587716045644651

So the most important parameters, based on our hyperband results, are the activation of our GRU and  the learning rate. Now it’d also be interesting to see the most important pairwise parginals, which delivers the following output:

OrderedDict([((0, 4), 0.36249670702877856), ((0, 3), 0.29377402369002936), ((0, 5), 0.2549142576466587), ((0, 1), 0.24357273146419461), ((0, 2), 0.21186123918820465), ((1, 4), 0.17756318441962193), ((3, 4), 0.17587568997781963), ((4, 5), 0.16198828370946314), ((2, 4), 0.15190021428529837), ((3, 5), 0.082464880530749921), ((1, 3), 0.080988329407209986), ((1, 5), 0.067573172848896731), ((2, 3), 0.062798863961704976), ((2, 5), 0.048585901316270177), ((1, 2), 0.048014227386887563)])

So the most important combination is, unsurprisingly, activation1 and learning rate. The second most important combination is activation1 with the factor deciding on how much the learning rate should be reduced when there’s no improvement after 2 epochs.

When we look at fANOVA’s figures, we can identify some new ranges for another hyperband run. Since we decided to take the MSE as measurement, we have to look at the regions in which the performance is the lowest. So in the case of the dropout its new range would be [0.15, 0.19].

By looking at the categorical plots,  we can see in the case of the activation1 parameter that a tanh activation would deliver the lowest performance and not a linear one, as determined by hyperband. This can happen since hyperband finds a good hyperparameter combination of all parameters together. And it seems that a tanh activation contributed to an overall better performance by looking at all hyperband runs.

So our new configuration for hyperband would be:

import ConfigSpace as CS
cs = CS.ConfigurationSpace()
Ps=[
    CS.CategoricalHyperparameter('activation1', ['tanh', 'linear', 'sigmoid']),
    CS.CategoricalHyperparameter('dense_activation', ['tanh', 'linear', 'sigmoid']),
    CS.CategoricalHyperparameter('rnnunit1', ['8','16','32','64','128']),
    CS.(UniformFloatHyperparameter('dropout', 0.15,0.19),
    CS.UniformFloatHyperparameter('factor', 0.1,0.2),
    CS.UniformFloatHyperparameter('lr', 0.004, 0.009),
]
cs.add_hyperparameters(Ps)

And then you’d repeat the procedure until you have a fixed and clear result.

Please note that the dataset is too small and that’s why fANOVA’s result don’t look that meaningful. Keep in mind that the more data the better the result 😉

Leave a comment