Introduction

Semi structured data contains both structured and unstructured data or structured semi-structured and unstructured data. This data has structure but is not the same as the data model’s structure and lacks the rigid/fixed schema with types of data structured unstructured semi-structured. The fact that such does not reside in the rational database due to its organisational properties makes its analysis easier. At times such data can also be stored in a relational database.

  1. Characteristics of semi structured Data
  2. Sources of semi structured Data
  3. Advantages of Semi structured Data
  4. Disadvantages of Semi structured data
  5. Issues in storage
  6. Solutions to data storage of semi structured data
  7. Issues in extracting information

1. Characteristics of semi structured Data

As said above, semi structured data lacks a fixed schema. It has the below characteristics.

  • Data has some structure which, however, does not conform to the structure of a data model.
  • A hierarchy is defined wherein all similar entities form a group, and such groups are organised into the hierarchy of semi structured data examples.
  • It is not storable as table columns and rows like data in a relational database.
  • The data, which is semi-structured, has metadata/elements and tags to help group it and describe its storage.
  • The attributes in any group of items typically are different.
  • The group of entities in a group may not or may have the same properties and attributes.
  • Semi structured data is hard to manage or automate as its metadata is insufficient and hence cannot be put into a table with rows & columns.
  • Programming such data is difficult as it lacks a sufficient defined structure.

2. Sources of semi structured Data

Semi structured data is obtained from varying sources. Some of the types of semi structured data  are

  • Web pages.
  • Markup languages like HTTP, XML etc.
  • E-mails.
  • Executable elements of binary data.
  • Files that are zipped. 
  • TCP/IP data packets.
  • Data integration sets of data got from various sources.

3. Advantages of Semi structured Data

Working with semi-structured data has its own advantages. Some of them are

  • Heterogeneous sources may be used for data analysis.
  • The data is not restricted by a rigid or fixed schema.
  • A semi-structured data model is portable.
  • Flexibility is a key factor as the data schema can be easily altered.
  • Users unable to use SQL can be supported easily using semi-structured data.
  • Structured data is not restrained from being viewed as semi-structured data.

4. Disadvantages of Semi structured data

With advantages come disadvantages too. Some disadvantages and difference between structured and semi structured data are

  • Storage of data is an impediment as such data lacks a rigid or fixed schema.
  • Relational interpretations of data elements are difficult due to the lack of distinction between the difference between semi-structured and unstructured data’s data and schema.
  • Unlike when working with structured data, the queries become inefficient.

5. Issues in storage

  • In comparison to structured data, the costs of storage with semi-structured data is always higher.
  • As data and its schema are dependent on each other and tightly coupled, queries on such data cause frequent updates to the schema as well as the data. The same query can update the original data schema more frequently than is desirable.
  • Semi-structured data typically has a partially-structured structure. Some of it may have a definite data structure, while other sets may not have a conforming structure making it difficult to interpret and understand the data relationships.
  • Due to the lack of distinction between the data and its schema, designing the data structure can often become a complicated issue.

6. Solutions to data storage of semi structured data

  • Relational DBMS knowing the difference between structured semi-structured and unstructured data with examples can also be used for data storage when the data is mapped to a relational schema and then used to map it into a table format.
  • A DBMS designed for the storage of such data can easily store semi-structured data.
  • OEM or Object Exchange Model uses graphs and can hence be used to exchange and store semi-structured data.
  • XML files do not have tightly bonded data and schema structures. Typically XML is used when storing such data in a hierarchical form with attributes and tags defining the elements helping exchange data and storing semi-structured data.

7. Issues in extracting information

 Since heterogeneous sources are used, semi-structured data generally have only partial structure or no structure at all. Thus to index or tag the data and information extraction from semi-structured data is a tough job. These issues can be solved as follows.

  • OEM data modelling techniques and models permit storage of such data in the form of a graph-based model, which makes such data more accessible and easier to search/find via the appended tags and index.
  • XML permits the searching/indexing of semi-structured data as the data in XML uses a hierarchical order. 
  • Models that are graph-based, like the OEM model, can be successfully used to index and tag the semi-structured data.
  • Data mining tools are efficient in extracting information from semi-structured data.

Conclusion

It is critical in information extraction and data analytics to understand the differences between structured, semi-structured data and unstructured data. Understanding the data also enables more accessible storage of such data and zeroes in on the various techniques used to extract, search, find and analyse such data.

If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional. 

ALSO READ 

SHARE
share

Are you ready to build your own career?