# Rectified Linear Unit (ReLU): An Important Introduction (2021)

**Introduction**

In a multiple-layer network, the activation function in neural networks is responsible for transforming the summed weighted input from the node into the node’s activation or output for that input. ReLU is also known as rectified linear activation function, is a linear piecewise function that outputs directly if the input is positive and outputs zero if the input is not positive. It is popular in neural networks as a default activation function, helping the model better perform and train.

**Limitations of Sigmoid and Tanh Activation Functions****Rectified Linear Activation Function****How to Implement the Rectified Linear Activation Function****Tips for Using the Rectified Linear Activation****Extensions and Alternatives to ReLU**

## 1. **Limitations of Sigmoid and Tanh Activation Functions**

Nearly all neural networks are multilayered and have node layers that help the algorithm and network map learn the examples outputs from the inputs. For nodes, the weighted inputs are used and summed up to give the node’s summed activation. The activation function uses this summed/sigmoid activation function to define the node’s activation, which then provides a specific output. The simplest form applies no transformation and is called linear activation. When using such trainable linear networks, it is hard to train them to learn a complex nature’s mapping functions. Thus these are used as the networks outer layer to predict the quantity of output as in regression problems.

Nonlinear activation functions, like the ReLU, are preferred to train the learning nodes on the data’s complex structures. For Ex: tanh formula and sigmoid activation functions. The logistic sigmoid activation function causes the input’s value to be transformed into values between one and zero. When inputs are larger than one, it transforms it to one, and when the values are small, they are transformed to value zero. The sigmoid function’s shape for all input values possible is the S (Sigma)-shape from values of zero through to 0.5 and then to one.

For output values lying between 1 and -1, the tanh function works well and produces a similar curve and is used because its predictive performance is better, and the model using it is easy to train. However, both these functions saturate and are responsive to change around the input middle values only. At saturation, the algorithm does not adapt to the weights, and hence activation for the learning algorithm slows down.

## 2. **Rectified Linear Activation Function**

Large networks use nonlinear activation functions like the ReLU in its deep layers, which then fail to receive ReLU formula gradient information that is useful. The error is then backpropagated and used for weights updates. If the error sums decrease with the layers, it is propagated using the chosen activation function derivate. At one point, the ReLU equation gradient is zero, and the lack of slope means inactive nodes cause the vanishing gradient problem and the network learning halts.

To prevent this problem, a small linear value is added to the weights by the ReLU to ensure the gradient of the ReLU graph never becomes zero in the ReLU vs sigmoid comparison. Boltzmann machines, unsupervised pre-training and layer-wise training of the ReLU function formula are also used effectively to resolve these ReLU vs tanh network issues.

## 3. **How to Implement the Rectified Linear Activation Function**

ReLU function can be implemented quite easily in Python using the max() function. It is expected that for zero input and negative value inputs, the output will be zero, and positive input values will be unchanged. The ReLU derivative function required to update the node weights is easy to calculate the sigmoid function python in error backpropagation. Since the function’s derivative represents the slope, all negative values have a slope of zero, while for positive values, the slope is one. At zero, the ReLU activation function is not differentiable, and the tanh derivative can be assumed to be zero for machine learning tasks.

**Advantages:**

The rectified linear activation function is the modern day’s most popular default activation function for nearly all kinds of neural networks for the following reasons.

- Simplicity in computations
- Very sparse representation
- Behaves linearly for all types of the activation function
- It can be used for deep network training

## 4. **Tips for Using the Rectified Linear Activation**

**ReLU****can be used for the Default Activation Function:**Initially, sigmoid functions were used in activation, then tanh functions and finally, ReLU is popular for activation functions of deep learning multi-layer networks.**ReLU****can be used with CNNs, MLPs, and not RNNs:**ReLU’s function fine in CNN- Convolutional Neural Networks, MLP- Multilayer Perceptron but not RNN- Recurrent Neural Networks like the LSTM- Long Short-Term Memory Networks by default.**Use a smaller bias value as input:**The input bias on the node causes the activation**“He Weight Initialization” method:**When neural networks are trained, the weights are initialized to small random values so the weights are never zero, at which point half the network units also have zero value, and initialization may fail.**Scale Input Data:**To do so, one can normalize to the scale of zero to one each value, standardize variables to have a mean of zero or scale to unit variance. Without scaling, the neural network weights become large, causing a generalization error and instability in the network.**Weight Penalty usage:****ReLU**output in the positive domain is unbounded. To prevent size growth, it is best to use the L2 or L1 norm of weight regularization.

## 5. **Extensions and Alternatives to ReLU**

The alternatives and extensions of the ReLU are discussed here. When regardless of the input to the network, large weight updates cause the summed input to the activation function to be negative. The node has an activation value of zero known as the “*dying ReLU*“ issue. Thus if the gradient is zero in an inactive unit, activation fails because the optimization algorithm is gradient-based and does not adjust the unit weights since it is inactive initially, causing slow-learning of the ReL network. The alternatives are

- The LReL/ LReLU or Leaky ReLU is a variant. When the input is less than zero (For Ex: Not active due to saturation), it allows small non-zero gradient negative values in the leaky ReLU vs ReLU functions.
- The ELU- Exponential Linear Unit, transitions from positive to non-zero negative values using an exponential function parameterized pushing the activation mean to zero and enabling faster learning.
- The PReLU- Parametric ReLU, learns the control parameters for the rectifier function’s leakiness and shape.
- Maxout facilitates the improvement of dropout and optimization by dropout in the model averaging technique used with a fast approximate accuracy of dropouts. It uses linear functions of pieces of alternate functions in the technique of dropout regularization.

**Conclusion**

In conclusion, the multilayer networks cannot use hyperbolic tangent and sigmoid activation functions due to the vanishing gradient issue. Currently, ReLU is used as the default activation in convolutional neural and Perceptron multilayer networks development. The ReLU activation function solves this issue permitting models to perform better and learn faster.

There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this **Machine Learning And AI Courses **by Jigsaw Academy.