Apache Spark

Apache Spark is a data processing and data analytics engine used in data engineering and data science for huge data. Spark was introduced by Apache Foundation for having speed up of Hadoop computational computing software process. Spark is not considered the modified version of Hadoop. Spark uses Hadoop in two ways-one in storage way and the second way in processing.

Apache spark is a very fast cluster computation. It will depend on the Hadoop MapReduce and it extends the MapReduce model which is efficiently used for more types of computations that includes interactive queries and stream processing. The spark has its built in-memory cluster computing that increases the processing speed of the application. Spark is developed to hide a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming.

Features of Apache Spark:

Apache Spark will have the following

Speed- Spark will help to execute the application in the Hadoop cluster that is up to 100 times faster in the memory and also 10 times faster when running on disk. This can happen by reducing the number of reading/writing operations on the disk.
Supports multiple languages- Spark has built-in APIs in Java, Scala or maybe in python. We can write applications in different languages. Spark will come with 80 high-level operators for interactive querying.
Advanced Analytics- Spark will not only supports ‘Map’ or ‘Reduce’. It supports SQL queries, Streaming data, Machine learning and also Graph algorithms.

Spark that is built on Hadoop:

We have three ways of spark deployment

Standalone- Spark Standalone deployment means Spark occupies the place on top of HDFS (Hadoop Distributed File System) and space is allocated for HDFS, externally. The Spark and MapReduce will run side by side and cover all the spark jobs on a cluster.
Hadoop Yarn- Hadoop Yarn deployment means simply, a spark that runs on Yarn that is without any pre-installation or root access required. It helps to integrate spark into the Hadoop ecosystem. The other components run on top of the stack.
Spark in MapReduce- Spark in MapReduce will be used to launch spark job in addition to standalone deployment. With SIMR we can reuse its shell without any administrative access.

Components of Spark

Apache spark core

Is an underlying general execution engine for spark platform where all other functionality will be built upon It always gives in-memory computing and referencing datasets in an external storage system.

Spark SQL

It is a component on top of the spark core that may introduce a new data abstraction that is called SchemaRDD, which will provide support for structured and semi-structured data.

Spark Streaming

Spark SQL will leverage Spark’s Core’s fast scheduling capability to perform streaming analytics. It always ingests data in the mini-batches and also performs RDD transformations on those mini-batches of data.

MLib(Machine Learning Library)

MLib will be distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. Spark MLib will be nine times as fast as the Hadoop disk-based version of Apache Mahout.

GraphX

GraphX will be distributed graph-processing framework on the top of spark. It always gives an API for expressing graph computation which will model the user-defined graphs by using pregal abstraction API.

Resilient Distributed Datasets

Resilient Distributed Datasets(RDD) is a basic data structure of spark. It may be an immutable distributed collection of objects that can be divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, scale objects including user-defined classes. RDD is a read-only, partitioned collection of records. It will be created through deterministic operations on either data on stable storage or maybe other RDDs. RDD will be a fault-tolerant collection of elements which can be functioned in parallel.

Questions

What is Apache Spark? What are the components of Spark
Explain the features of Apache spark?

2 Responses

Pingback: Apache Mesos
Pingback: Best Hadoop Certifications: Boost Your Data Skills

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article

Must-Know Python Interview Questions for Freshers and Experienced

July 15, 2025

How Salesforce Content Enhances Your CRM Strategy

July 15, 2025

Microsoft Excel: Advanced features for data analysis

July 15, 2025

What Are the Basic Components of Power BI?

July 15, 2025

TOSCA XSCAN Guide: Add New Controls and Resolve Duplicates

July 11, 2025

Need a Free Demo Class?

Join H2K Infosys IT Online Training

Enroll Now

Why Java Certification Still Matters in Today’s Job Market

July 3, 2025

Top Java Projects to Build a Strong Portfolio in 2025

July 2, 2025

Essential Full Stack Software Engineer Interview Questions

August 12, 2024

Upgrade Your UI Developer Skills

August 9, 2024

Complete Guide to Java UI Frameworks

August 9, 2024

Full Stack PHP Development for Beginners: Step-by-Step Guide

August 8, 2024

Best Full Stack Developer Certifications to Elevate Your Career

August 7, 2024

C Language Basic Syntax

July 23, 2024

Java as a Platform Independent Language

July 5, 2024

What to learn between Java and Python in 2024

June 25, 2024

Steven Roger

Steven Roger is a technology blogger for the H2K Infosys blog, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

Read All from Steven Roger