[SiLU] [dSiLU] Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Stefan Elfwing, Eiji Uchibe, Kenji Doya
{, }
Paper: https://www.sciencedirect.com/science/article/pii/S0893608017302976
Stefan Elfwing, Eiji Uchibe, Kenji Doya
{, }
Paper: https://www.sciencedirect.com/science/article/pii/S0893608017302976
In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning.
Propose two activation functions for neural network function approximation in reinforcement learning:
The sigmoid-weighted linear unit (SiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input.
Its derivative function (dSiLU).
Two activation functions are proposed for neural network function approximation in reinforcement learning:
The sigmoid-weighted linear unit (SiLU)
Derivative function (dSiLU).
The activation ak of the kth SiLU for input zk is computed by the sigmoid function multiplied by its input (i.e., equal to the contribution from a hidden node to the value function in an EE-RBM):
zk is the input to hidden unit k, σ(·) is the sigmoid function.
For zk-values of large magnitude, the activation of the SiLU is approximately equal to the activation of the ReLU (see the left panel in Fig. 1), i.e., the activation is approximately equal to zero for large negative zk-values and approximately equal to zk for large positive zk-values.
Unlike the ReLU (and other commonly used activation units such as sigmoid and tanh units), the activation of the SiLU is not monotonically increasing. Instead, it has a global minimum value of approximately −0.28 for zk ≈ −1.28. An attractive feature of the SiLU is that it has a self-stabilizing property, which we demonstrated experimentally in Elfwing et al. (2015). The global minimum, where the derivative is zero, functions as a ‘‘soft floor’’ on the weights that serves as an implicit regularizer that inhibits the learning of weights of large magnitudes.
In Elfwing et al. (2015), we discovered that the derivative function of the SiLU (i.e., the derivative of the contribution from a hidden node to the output in an EE-RBM) looks like a steeper and ‘‘overshooting’’ version of the sigmoid function.
In this study, we call this function the dSiLU and we propose it as a competitive alternative to the sigmoid function in neural network function approximation in reinforcement learning.
The activation of the dSiLU is computed by the derivative of the SiLU (see right panel in Fig. 1)
The dSiLU has a maximum value of approximately 1.1 and a minimum value of approximately −0.1 for zk ≈ ±2.4, i.e., the solutions to the equation zk = − log ((zk − 2)/(zk + 2)).
Figure. Learning curves in stochastic SZ-Tetris for the four types of shallow neural network agents.
SZ-Tetris
10 × 10 Tetris
Atari 2600 games
n2 n0
θ