The debate of preferring deep learning over machine learning is going on for a while among data scientists, especially after Google made the tensor-flow as an open source in 2016. Deep learning algorithms are out performing the machine learning approaches in some imaging applications like face recognition and object detection. Deep Learning (DL) algorithms are like black boxes, whereas the Machine Learning (ML) algorithms are like art. One needs to be innovative to handcraft the features from the images to build the ML models.

Deep learning algorithms need more data(images) to learn the patterns from it, but in many domains, it is not easy to acquire more data. Whenever there is not enough data, we can use a machine learning approach. In ML approach, a programmer needs to perform feature extraction. For example, say there are tumor images, in ML approach the job of the programmer is to extract the numeric features like, area, diameter, volume, smoothness, edge patterns etc., from the tumor images. To extract those features, one should have both programming and domain knowledge. In Deep Learning approach we will just feed more and more tumor images to the model and the features will be learned by the model on its own. The role of the programmer is minimum in the case of Deep Learning as the feature extraction process is not explicit.In this article we are going to take you through a step by step approach on implementing image processing case study through ML approach. The example that we are going to discuss in this article is about predicting the malignancy of the nodules(tumor) present inside the lung region.

The diagnosis of lung cancer at an early stage is critical and uncertain. The lung is a sponge kind of tissue anatomy. The process of taking a tissue from the lung is known as a biopsy, to examine its cancerous nature is a painful process and taking a tissue accurately from a small abnormal tissue cluster inside a lung is a challenging task. Physicians will not recommend  biopsy without reliable evidence for lung cancer. Physicians analyse the CT scan of a patient and if they suspect any symptoms of lung cancer, direct the patient to undergo one more CT scan after a time span of 3,6,9,12 or 18 months based on patient smoking habits and environmental conditions. After analyzing CT images at different intervals and based on the level of disease progression, a biopsy is performed.

The qualitative analysis performed by the physicians may vary from one expert to another due to human factors and the huge number of images that they need to analyse, which is time consuming and requires a trained eye. Hence, it has become essential to develop an intelligent computerized model to analyse the CT scans for identifying the malignancy in present. The objective of this case study is to segment the region of interest from the image and extract the features to build a machine learning model to detect the malignant nature of lung from CT scan images.

The database provided for this work consists of data from 50 patient CT scans. Each CT scan has 60 to 250 cross sectional CT images in it. The small white tissue clusters inside the lung parenchyma region is known as ‘nodule’. These nodules may be the potential indicator for lung cancer (but most of the nodules are generally benign !!!).MACHINE LEARNING APPROACH

We know that the machine learning algorithm needs the tabular numeric data as an input. We need to extract the features from all the suspected nodules from the lung parenchyma region. Our region of interest to diagnose the cancer is only the nodules present inside the lung, not the whole lung region. Therefore, the first step we need to perform is to segment this region of interest from the CT scan. The following section of the script will load the image and perform segmentation. There are many segmentation techniques reported in literature. In this work we used the OSTU threshold-based segmentation technique. 

import cv2

import matplotlib.pyplot as plt

import numpy as np

import time


cv2.imshow(“Lung Image”,img)

th,ostu_img = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)

In the above figure, along with nodules other tissue regions are also segmented as white regions. Morphological processing will be performed to remove those surrounded white regions. Different morphological techniques like erosion, dilation, inversion, opening, closing, filling etc., can be performed to get the region of interest alone from the segmented images. Choice of appropriate morphological technique is based on the output of the initial segmentation process. In this example case, we have used the clear-border morphology to remove all the unwanted white regions and keep only the nodules.

from skimage.segmentation import clear_border



Also, we can perform different morphological operations to get the lung masks and the nodules as shown below:








img4_fill = cv2.morphologyEx(img4, cv2.MORPH_CLOSE, se_fill)




img4_open = cv2.morphologyEx(img4_fill, cv2.MORPH_OPEN, se_open)



paren=img & img4_open



thos,nod_th = cv2.threshold(paren,100,255,cv2.THRESH_BINARY)




nodules = cv2.morphologyEx(nod_th, cv2.MORPH_OPEN, se_open)


After segmenting all the nodules from the lung CT images, different features need to be computed for all the nodules. Generally, features will be computed on three platforms: 1. Shape 2. Texture and 3. Colour. In this work we computed the shape based features.

nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(nodules)


sizes = stats[1:, -1];

min_size = 15

nodules1 = np.zeros((labels.shape),dtype=’uint8′)

for i in range(0, nlabels-1):

    if sizes[i] >= min_size:

        nodules1[labels == i + 1] = 255 


nlabels1, labels1, stats1, centroids1 = cv2.connectedComponentsWithStats(nodules1)





for i in range(1,nlabels1):







    x,y,w,h = cv2.boundingRect(nod1)

    feat[i-1,0] = float(w)/h #Aspect Ratio

    cc, hierarchy = cv2.findContours(nod1.copy(),cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)

    area = cv2.contourArea(cc[0])


    rect_area = w*h

    feat[i-1,2]= float(area)/rect_area #Extent

    hull = cv2.convexHull(cc[0])

    hull_area = cv2.contourArea(hull)

    feat[i-1,3]=hull_area #hull area

    feat[i-1,4] = float(area)/hull_area #solidity

    feat[i-1,5] = np.sqrt(4*area/np.pi) #equi_diameter

    [(x,y),(MA,ma),angle] = cv2.fitEllipse(cc[0])

    feat[i-1,6] =angle



Nine shape features for 18 segmented nodules from one CT scan are extracted after executing the above script.

Nodules Aspect Ratio Area Extent Hull


Solidity Equi-diameter Angle Centroid_X Centroid_Y
Nodule 1 1 30.5 0.376543 35.5 0.859155 6.23168 150.249 279.659 189.854
Nodule 2 0.348315 788.5 0.285792 1845 0.427371 31.6852 159.649 335.058 248.16
Nodule 3 0.5 11 0.34375 13.5 0.814815 3.74241 13.3688 358.421 212.526
Nodule 4 0.9375 1041.5 0.271224 2109.5 0.493719 36.4154 141.711 190.555 264.887
Nodule 5 0.7 42 0.6 43.5 0.965517 7.31273 13.31 203.741 239.519
Nodule 6 0.833333 11.5 0.383333 12 0.958333 3.82652 139.736 216.5 244.611
Nodule 7 0.833333 71.5 0.595833 74 0.966216 9.54131 156.121 323.244 247.081
Nodule 8 0.625 326.5 0.3265 614 0.531759 20.389 159.57 229.711 286.905
Nodule 9 1.5 29.5 0.546296 31 0.951613 6.12867 81.095 166.575 277.8
Nodule 10 1.28571 22 0.349206 29 0.758621 5.29257 119.901 255.094 278.969
Nodule 11 0.5 52 0.530612 55.5 0.936937 8.13686 18.0453 175.104 296.403
Nodule 12 1 11.5 0.319444 11.5 1 3.82652 45 333.722 292.722
Nodule 13 0.666667 65.5 0.303241 99 0.661616 9.13221 162.512 365.172 302.483
Nodule 14 0.702703 484.5 0.503638 576.5 0.840416 24.8372 9.18256 159.629 318.956
Nodule 15 1 21 0.259259 24 0.875 5.17088 136.903 384.129 317.387
Nodule 16 0.714286 18 0.514286 18.5 0.972973 4.78731 20.5219 320.923 357.962
Nodule 17 0.833333 14 0.466667 14.5 0.965517 4.22201 28.568 336.905 371.476
Nodule 18 1.75 11 0.392857 12 0.916667 3.74241 110.04 359 370.5


Out of these 18 nodules, nodule 14 has been detected as cancerous by the medical expert. Hence, the corresponding nodule has been labelled as ‘1’ while the other nodules have been labelled as ‘0’. 

Similarly, the nodule segmentation and feature extraction process can be carried out on different patient’s scan images and by concatenating all this information we can form a structured supervised data-set, on which any machine learning classification algorithms can be applied to build a predictive model. 

The skill of deriving the structured data from the un-structured data is the vital step to build the predictive model. We hope this article gives you the hand-on experience of converting the images in to structured data.