AlexNet was a groundbreaking breakthrough when it was released in 2012.CNN’s, i.e., Convolutional Neural Networks, is one of the most successful image recognition models. Before VGG came along. In this article, we will discuss VGG, VGG architecture, and the VGG model.

  1. AlexNet
  2. DataSet
  3. VGG Neural Networks
  4. The Difference

1) AlexNet

AlexNet is a convolutional neural network architecture that has been around for a long time. The basic building blocks are convolutions, max pooling, and dense layers. To suit the model across two GPUs, grouped convolutions are used. Its main features include the use of ReLU rather than the tanh function, multi-GPU optimization, and overlapping pooling. It used data augmentation and dropout to combat overfitting.

2) DataSet

ImageNet is a database that contains over 15 million high-resolution photographs with labels divided into approximately 22,000 categories and serves as a general benchmark for image recognition. Using Amazon’s Mechanical Turk crowd-sourcing platform, the photos were gathered from the internet, and Human labelers were used to mark the samples. The Pascal Visual Object Challenge has included an international competition called the ImageNet Visual Recognition Challenge at a Large Scalesince 2010. (ILSVRC). ILSVRC makes use of a subset of ImageNet, with approximately 1000 images in each of 1000 categories.

There are 150,000 teaching images, 1.2 million confirmation images, and 1.2 million research images online. ImageNet is a set of photographs with varying resolutions. It was shown that representation depth improves classification accuracy and that using a typical ConvNet architecture with greatly improved depth, state-of-the-art performance on the ImageNet challenge dataset can be achieved.

3) VGG Neural Networks

VGG tackles one of CNNs’ most critical features: Let’s take a look at VGG ‘s architecture:

Contribution. VGG accepts a 224×224 pixel RGB image as input. To keep the input image size constant, the writers cropped out the middle 224×224 patch in of image for the ImageNet competition.

Convolutional Layers are a kind of convolutional layer. VGG ‘s convolutional layers have a very small receptive field (3×3, the smallest dimension that encompasses both up and down and left and right). There are also 1×1 convolution filters that perform a linear transformation of the input before passing it through a ReLU unit. The convolution stride is set to 1 pixel to maintain spatial resolution after convolution.

Layers that are completely linked. VGG has three completely connected layers, the first two of which each has 4096 channels and the third of which has 1000 channels, one for each class.

Unseen layers VGG employs ReLU in all of its secret layers (AlexNet came up with a big idea that sliced training time in half.). Local Response Normalization (LRN) is not used by VGG because It improves memory retention and training time while providing little improvement inaccuracy.

An annual computer vision competition, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), is held. Teams compete on two projects per year. Object localization is the first step in detecting objects in an image from 200 different groups. The second phase is image classification, which involves labeling each image with one of 1000 categories. Karen Simonyan and Andrew Zisserman of Oxford University’s Visual Geometry Group Lab proposed VGG 16 in their paper “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION” in 2014. In the 2014 ILSVRC challenge, this model took first and second place in the above categories.

4) The Difference

AlexNet has eight layers, including five convolutional layers and three completely connected layers, as well as three max-pooling layers after the first, second, and fifth convolutional layers. The first convolutional layer contains 96 11 x 11 filters with a 4-pixel stride and 2-pixel padding. Other convolutional layers’ stride and padding are all set to 1 pixel. The second convolutional layer contains 256 5 X 5 filters. The third, fourth, and fifth convolutional layers, respectively, have 384, 384, and 256 3 X 3 filters.

The used VGG 16 is much more complex, with 16 weight layers, including thirteen convolutional layers with a 3 X 3 filter size and completely connected layers. The fully connected layer configurations in VGG -16 and AlexNet are identical. Both convolutional layers’ stride and padding are set to 1 pixel. Each convolutional layer is divided into five groups, with a max-pooling layer following each group.


VGG is a convolutional neural network architecture that has been around for a long time. It was based on a study of how to make certain networks denser. Tiny 3 x 3 filters are used in the network. The network is otherwise distinguished by its simplicity, with only pooling layers and a completely linked layer as additional components. The ImageNet dataset comprises RGB-channelled images with a fixed size of 224*224 pixels. As a result, our input is a tensor of (224, 224, 3).

This model takes the input image and turns it into a 1000-value vector. VGG-16 was one of the most successful architectures in the 2014 ILSVRC challenge. With a top-5 classification error of 7.32 percent, it came in second place in the classification challenge. It also took first place in the localization task, with a localization error of 25.32 percent.

If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional. 

Also Read