You have seen my blog post on groundbreaking activation functions (shameless link) and chosen the one you want to use. The problem now is calculating the gradient (the derivative) so that you may update your weights. Perhaps you are curious like me and just isn’t satisfied with just being given the derivative and want to find out how to do it yourself. Here are the gradient calculations.

A note: sometimes models reuse the output of $f(x)$ to save on computing $f’(x)$, the derivative.

Reusing Calculations

If you already know $f(x)$ it may help when calculating $f’(x)$. For example, the derivative of the sigmoid function is

$f’(x) = \frac{1}{1+e^{-x}} * (1- \frac{1}{1+e^{-x}}) = f(x) * (1-f(x))$.

We would plug in $f(x)$ into the equation to find $f’(x)$ instead of calculating the derivative from scratch, saving computational time.

Sigmoid (Logistic)

I still use this, mainly due to my limited knowledge of neural networks. Great for classification problems.

Sigmoid Graph

$$\begin{align} f(x) & = \frac{1}{1+e^{-x}}\\ f'(x) & = \frac{d}{dx} (1+e^{-x})^{-1} \\ & = -(1+e^{-x})^{-2} \times \frac{d}{dx} (1+e^{-x})\\ & = (1+e^{-x})^{-2} \times e^{-x}\\ & = \frac{e^{-x}}{(1+e^{-x})^{2}}\\ & = \frac{(1+e^{-x})-1}{(1+e^{-x})^2}\\ & = \left(\frac{1}{1+e^{-x}}\right) - \left(\frac{1}{(1+e^{-x})^2}\right)\\ & = \frac{1}{1+e^{-x}} \times \left(1-\frac{1}{1+e^{-x}}\right)\\ & = f(x)\times (1-f(x)) \end{align}$$

Tanh

Scaled up version of the sigmoid function but better.

Tanh Graph

$$\begin{align} f(x) & = \tanh(x)\\ f'(x) & = \frac{d}{dx} \left(\frac{\sinh(x)}{\cosh(x)}\right)\\ & = \left(\frac{\cosh(x)\times\frac{d}{dx}\sinh(x)-\sinh(x)\times\frac{d}{dx}\cosh(x)}{\cosh(x)^2}\right)\\ & = \left(\frac{\cosh(x)^2-\sinh(x)^2}{\cosh(x)^2}\right)\\ & = 1-\left(\frac{\sinh(x)}{\cosh(x)}\right)^2\\ & = 1-\tanh(x)^2 \end{align}$$

Softplus

I didn't write about this function in my other blog post since the opinions I read about it said it was just ReLU that has a continuous derivative. But there has been some research into its properties. Interestingly, its derivative is the sigmoid function. (Sigmoid function is denoted by $\sigma(x)$)

Softplus Graph

$$\begin{align} f(x) & = \log(1+e^{x})\\ f'(x) & = \frac{1}{1+e^x}\times\frac{d}{dx}(1+e^x)\\ & = \frac{e^x}{1+e^x}\\ & = \frac{1}{e^{-x}+1}\\ & = \sigma(x)\\ \end{align}$$

Swish

Google released this activation function which has beaten benchmark tests for ReLU. I suggest reading about its discovery, which is pretty fascinating. (Sigmoid function is denoted by $\sigma(x)$)

Swish Graph

$$\begin{align} f(x) & = \frac{x}{1+e^{-x}}\\ f'(x) & = \frac{(1+e^{-x})\frac{d}{dx}x-x\times\frac{d}{dx}(1+e^{-x})}{(1+e^{-x})^2}\\ & = \frac{(1+e^{-x})+xe^{-x}}{(1+e^{-x})^2}\\ & = \frac{x(1+e^{-x})+(1+e^{-x})-x}{(1+e^{-x})^2}\\ & = \frac{x}{1+e^{-x}}+\frac{1}{1+e^{-x}}-\frac{x}{(1+e^{-x})^2}\\ & = \frac{x}{1+e^{-x}}+\frac{1}{1+e^{-x}}\left(1-\frac{x}{1+e^{-x}}\right)\\ & = f(x)+\sigma(x)\times(1-f(x))\\ \end{align}$$