Apache Spark is a data processing and data analytics engine used in data engineering and data science for huge data. Spark was introduced by Apache Foundation for having speed up of Hadoop computational computing software process. Spark is not considered the modified version of Hadoop. Spark uses Hadoop in two ways-one in storage way and the second way in processing.
Apache spark is a very fast cluster computation. It will depend on the Hadoop MapReduce and it extends the MapReduce model which is efficiently used for more types of computations that includes interactive queries and stream processing. The spark has its built in-memory cluster computing that increases the processing speed of the application. Spark is developed to hide a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming.
Features of Apache Spark:
Apache Spark will have the following
- Speed- Spark will help to execute the application in the Hadoop cluster that is up to 100 times faster in the memory and also 10 times faster when running on disk. This can happen by reducing the number of reading/writing operations on the disk.
- Supports multiple languages- Spark has built-in APIs in Java, Scala or maybe in python. We can write applications in different languages. Spark will come with 80 high-level operators for interactive querying.
- Advanced Analytics- Spark will not only supports ‘Map’ or ‘Reduce’. It supports SQL queries, Streaming data, Machine learning and also Graph algorithms.
Spark that is built on Hadoop:
We have three ways of spark deployment
- Standalone- Spark Standalone deployment means Spark occupies the place on top of HDFS (Hadoop Distributed File System) and space is allocated for HDFS, externally. The Spark and MapReduce will run side by side and cover all the spark jobs on a cluster.
- Hadoop Yarn- Hadoop Yarn deployment means simply, a spark that runs on Yarn that is without any pre-installation or root access required. It helps to integrate spark into the Hadoop ecosystem. The other components run on top of the stack.
- Spark in MapReduce- Spark in MapReduce will be used to launch spark job in addition to standalone deployment. With SIMR we can reuse its shell without any administrative access.
Components of Spark
Apache spark core
Is an underlying general execution engine for spark platform where all other functionality will be built upon It always gives in-memory computing and referencing datasets in an external storage system.
It is a component on top of the spark core that may introduce a new data abstraction that is called SchemaRDD, which will provide support for structured and semi-structured data.
Spark SQL will leverage Spark’s Core’s fast scheduling capability to perform streaming analytics. It always ingests data in the mini-batches and also performs RDD transformations on those mini-batches of data.
MLib(Machine Learning Library)
MLib will be distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. Spark MLib will be nine times as fast as the Hadoop disk-based version of Apache Mahout.
GraphX will be distributed graph-processing framework on the top of spark. It always gives an API for expressing graph computation which will model the user-defined graphs by using pregal abstraction API.
Resilient Distributed Datasets
Resilient Distributed Datasets(RDD) is a basic data structure of spark. It may be an immutable distributed collection of objects that can be divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, scale objects including user-defined classes. RDD is a read-only, partitioned collection of records. It will be created through deterministic operations on either data on stable storage or maybe other RDDs. RDD will be a fault-tolerant collection of elements which can be functioned in parallel.
- What is Apache Spark? What are the components of Spark
- Explain the features of Apache spark?