INTRODUCTION

Welcome to this comprehensive Spark RDD tutorial. RDD is used in Spark and is its main logical data unit. It is a collection of distributed objects which are stored in the memory or on a disk or various machines of the cluster. Each single RDD can be divided into many logical partitions. It is possible to store these partitions and to process them in multiple machines of the cluster.

It is crucial to note that the Spark RDD is read-only or immutable. You are not allowed to change the original RDD however it is possible to create new RDDS which is possible by doing some coarse grain operations on it. This could be a transformation on any existing RDD.

It is possible to cache the RDD in spark and it can be used again for future transformations. This is a huge benefit for the user.

RDDS are lazily evaluated which means that these delay the evaluation until required. This helps to save time and also improves efficiency.

  1. Features of an RDD in Spark
  2. Operations on RDDs
  3. Transformations
  4. Actions
  5. Creating an RDD
  6. Parallelizing the object Collection 

1.Features of an RDD in Spark

Here are the features of RDD in spark

  • Resilience is called fault tolerance and here the RDD tracks the data merge information to recover any lost data automatically when there is a failure.
  • Distributed is across various nodes in a cluster and the data that is present in the RDD resides on different nodes.
  • Lazy evaluation is where even when you define the RDD the data does not get loaded. The transformation is computed when an action is called which includes collect and count or saving the output in the file system.
  • Immutability is when the data that is stored in the RDD in the read-only mode cannot be edited that is there is the RDD. It is however possible to create the new Spark RDD by performing the transformation in the existing RDD
  • In-memory computation the RDD stores the immediate data that gets generated in the memory which is the RAM and not on the disk which offers fast access.
  • Partitioning is possible in the existing RDD that helps to create mutable logical parts. It is possible to manage this by applying transformation on the partitions that are existing

2.Operations on RDDs

RDD allows two basic operations. These are known as transformations and actions.

3.Transformations

Transformations are the functions that accept the RDD that are existing as input and output of one or more than one RDD. However, the data that is present in the RDD existing cannot change because it is immutable. The transformation functions are executed when these are called or invoked. Each time the transformation is applied then a new RDD gets created

  • The map function returns a new RDD which it does by applying the function to every data element
  • The filter function returns the new RDD that is formed by selecting the elements of the source on which the functions return the true
  • The reduceByKey functions aggregate the key values using this function
  • The groupByKey function converts the key and the pair value to a key and iterates the value pair
  • Union returns a new RDD which contains every element and the argument in the RDD source
  • Intersection returns the new RDD which contains the element intersection in the dataset

4.Actions

The actions in Spark are the functions that after an RDD computation return the result. It makes use of a lineage graph that helps to load data on the RDD in some particular order. After the transformation is done the action will return the final result to the Spark driver. Actions are the operations that offer a non- RDD value.

  • The count function gets the data element number in RDD
  • The collect function gets the data elements in the RDD in the form of an array
  • Reduce aggregates the elements of data in the RDD as it takes two arguments and then returns one
  • The take function fetches the first of the n elements in the RDD
  • The foreach function executes the operation for event data elements in the RDD.
  • The first function will retrieve the first element in the data of the RDD

5.Creating an RDD

It is possible to create the RDD using three methods. Let us find out what they are.

Loading the external dataset

It is possible to load the external file in the RDD. The types of files that can be loaded are txt, CSV, JSON, etc.

6.Parallelizing the object Collection 

When the method of Spark parallelizing is applied to some element groups then this creates a new and distributed dataset. This dataset that is created is the RDD.

Carrying out transformations in the existing RDD

You can create one or more than one RDD by doing a transformation on any existing RDD. The map function helps to create the RDD. However, the data that is present in the RDD may not be organized always. It is not structured because the data gets sourced from different places.

CONCLUSION

This brings us to the end of this Spark RDD tutorial. The above article explains all about how you can program with RDD in Spark. The Spark RDD tutorial explains everything about using RDD in Spark.

If you want to learn more about Java then check out Jigsaw Academy’s Master Certificate In Full Stack Development – a 170 hour-long live online course. It is the first & only program on Full Stack Development with Automation and AWS Cloud. Happy learning!

ALSO READ

SHARE