The debate of preferring deep learning over machine learning is going on for a while among data scientists, especially after Google made the tensor-flow as an open source in 2016. Deep learning algorithms are out performing the machine learning approaches in some imaging applications like face recognition and object detection. Deep Learning (DL) algorithms are like black boxes, whereas the Machine Learning (ML) algorithms are like art. One needs to be innovative to handcraft the features from the images to build the ML models.

Deep learning algorithms need more data(images) to learn the patterns from it, but in many domains, it is not easy to acquire more data. Whenever there is not enough data, we can use a machine learning approach. In ML approach, a programmer needs to perform feature extraction. For example, say there are tumor images, in ML approach the job of the programmer is to extract the numeric features like, area, diameter, volume, smoothness, edge patterns etc., from the tumor images. To extract those features, one should have both programming and domain knowledge. In Deep Learning approach we will just feed more and more tumor images to the model and the features will be learned by the model on its own. The role of the programmer is minimum in the case of Deep Learning as the feature extraction process is not explicit.In this article we are going to take you through a step by step approach on implementing image processing case study through ML approach. The example that we are going to discuss in this article is about predicting the malignancy of the nodules(tumor) present inside the lung region.

The diagnosis of lung cancer at an early stage is critical and uncertain. The lung is a sponge kind of tissue anatomy. The process of taking a tissue from the lung is known as a biopsy, to examine its cancerous nature is a painful process and taking a tissue accurately from a small abnormal tissue cluster inside a lung is a challenging task. Physicians will not recommend  biopsy without reliable evidence for lung cancer. Physicians analyse the CT scan of a patient and if they suspect any symptoms of lung cancer, direct the patient to undergo one more CT scan after a time span of 3,6,9,12 or 18 months based on patient smoking habits and environmental conditions. After analyzing CT images at different intervals and based on the level of disease progression, a biopsy is performed.

The qualitative analysis performed by the physicians may vary from one expert to another due to human factors and the huge number of images that they need to analyse, which is time consuming and requires a trained eye. Hence, it has become essential to develop an intelligent computerized model to analyse the CT scans for identifying the malignancy in present. The objective of this case study is to segment the region of interest from the image and extract the features to build a machine learning model to detect the malignant nature of lung from CT scan images.

The database provided for this work consists of data from 50 patient CT scans. Each CT scan has 60 to 250 cross sectional CT images in it. The small white tissue clusters inside the lung parenchyma region is known as ‘nodule’. These nodules may be the potential indicator for lung cancer (but most of the nodules are generally benign !!!).MACHINE LEARNING APPROACH

We know that the machine learning algorithm needs the tabular numeric data as an input. We need to extract the features from all the suspected nodules from the lung parenchyma region. Our region of interest to diagnose the cancer is only the nodules present inside the lung, not the whole lung region. Therefore, the first step we need to perform is to segment this region of interest from the CT scan. The following section of the script will load the image and perform segmentation. There are many segmentation techniques reported in literature. In this work we used the OSTU threshold-based segmentation technique. 

import cv2

import matplotlib.pyplot as plt

import numpy as np

import time


cv2.imshow(“Lung Image”,img)

th,ostu_img = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)

In the above figure, along with nodules other tissue regions are also segmented as white regions. Morphological processing will be performed to remove those surrounded white regions. Different morphological techniques like erosion, dilation, inversion, opening, closing, filling etc., can be performed to get the region of interest alone from the segmented images. Choice of appropriate morphological technique is based on the output of the initial segmentation process. In this example case, we have used the clear-border morphology to remove all the unwanted white regions and keep only the nodules.

from skimage.segmentation import clear_border



Also, we can perform different morphological operations to get the lung masks and the nodules as shown below:








img4_fill = cv2.morphologyEx(img4, cv2.MORPH_CLOSE, se_fill)




img4_open = cv2.morphologyEx(img4_fill, cv2.MORPH_OPEN, se_open)



paren=img & img4_open



thos,nod_th = cv2.threshold(paren,100,255,cv2.THRESH_BINARY)




nodules = cv2.morphologyEx(nod_th, cv2.MORPH_OPEN, se_open)


After segmenting all the nodules from the lung CT images, different features need to be computed for all the nodules. Generally, features will be computed on three platforms: 1. Shape 2. Texture and 3. Colour. In this work we computed the shape based features.

nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(nodules)


sizes = stats[1:, -1];

min_size = 15

nodules1 = np.zeros((labels.shape),dtype=’uint8′)

for i in range(0, nlabels-1):

    if sizes[i] >= min_size:

        nodules1[labels == i + 1] = 255 


nlabels1, labels1, stats1, centroids1 = cv2.connectedComponentsWithStats(nodules1)





for i in range(1,nlabels1):







    x,y,w,h = cv2.boundingRect(nod1)

    feat[i-1,0] = float(w)/h #Aspect Ratio

    cc, hierarchy = cv2.findContours(nod1.copy(),cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)

    area = cv2.contourArea(cc[0])


    rect_area = w*h

    feat[i-1,2]= float(area)/rect_area #Extent

    hull = cv2.convexHull(cc[0])

    hull_area = cv2.contourArea(hull)

    feat[i-1,3]=hull_area #hull area

    feat[i-1,4] = float(area)/hull_area #solidity

    feat[i-1,5] = np.sqrt(4*area/np.pi) #equi_diameter

    [(x,y),(MA,ma),angle] = cv2.fitEllipse(cc[0])

    feat[i-1,6] =angle



Nine shape features for 18 segmented nodules from one CT scan are extracted after executing the above script.

NodulesAspect RatioAreaExtentHull


Nodule 1130.50.37654335.50.8591556.23168150.249279.659189.854
Nodule 20.348315788.50.28579218450.42737131.6852159.649335.058248.16
Nodule 30.5110.3437513.50.8148153.7424113.3688358.421212.526
Nodule 40.93751041.50.2712242109.50.49371936.4154141.711190.555264.887
Nodule 50.7420.643.50.9655177.3127313.31203.741239.519
Nodule 60.83333311.50.383333120.9583333.82652139.736216.5244.611
Nodule 70.83333371.50.595833740.9662169.54131156.121323.244247.081
Nodule 80.625326.50.32656140.53175920.389159.57229.711286.905
Nodule 91.529.50.546296310.9516136.1286781.095166.575277.8
Nodule 101.28571220.349206290.7586215.29257119.901255.094278.969
Nodule 110.5520.53061255.50.9369378.1368618.0453175.104296.403
Nodule 12111.50.31944411.513.8265245333.722292.722
Nodule 130.66666765.50.303241990.6616169.13221162.512365.172302.483
Nodule 140.702703484.50.503638576.50.84041624.83729.18256159.629318.956
Nodule 151210.259259240.8755.17088136.903384.129317.387
Nodule 160.714286180.51428618.50.9729734.7873120.5219320.923357.962
Nodule 170.833333140.46666714.50.9655174.2220128.568336.905371.476
Nodule 181.75110.392857120.9166673.74241110.04359370.5


Out of these 18 nodules, nodule 14 has been detected as cancerous by the medical expert. Hence, the corresponding nodule has been labelled as ‘1’ while the other nodules have been labelled as ‘0’. 

Similarly, the nodule segmentation and feature extraction process can be carried out on different patient’s scan images and by concatenating all this information we can form a structured supervised data-set, on which any machine learning classification algorithms can be applied to build a predictive model. 

The skill of deriving the structured data from the un-structured data is the vital step to build the predictive model. We hope this article gives you the hand-on experience of converting the images in to structured data.


Are you ready to build your own career?