Spark sql has the native support and also streamlines where the process of querying the data stored in both RDD and also external sources. Spark SQL conveniently fades the lines between RDDs and also relational tables. By uniting these abstractions that makes it very easy for developers to intermix SQL commands querying external data with very difficult analytics, all in a single application. Spark SQL may allow
– To import relational data from the parquet files and hive tables
– to execute SQL queries through imported data and existing RDDs- that is easily write RDDs hive tables
Spark sql has a cost based optimiser and code generation to have queries fast. This verifies nodes and colourful queries using spark engine, which provides mid query fault tolerance.
Why is Spark SQL used?
Spark sql will be originated as Apache Hive to execute the spark and is now integrated with spark stack. Apache Hive is not suitable as Spark SQL may be built to overcome these days drawbacks and also replace apache hive.
How does Spark SQL work?
Spark SQL fades the lines between RDD and relational table. It provides more tighter integration between relational and procedural processing, through declarative DataFrame API that integrates with spark code. It also offers optimization. DataFrame API and Datasets API where they interact with Spark SQL.
The Apache spark can be used by more users and has progress towards optimisation for the current ones, Spark SQL may have DataFrame APIs which perform relational operations on external data sources and spark’s default distributed collections. The optimiser is called catalyst. It assists wide range of data sources and algorithms of big data.
Architecture of Spark SQL
Spark SQL libraries:
Spark sql has four libraries. The libraries have interactions with relational and procedural processing
1. Data source API(Application Programming Interface)
This is a universal API for loading and may be storing structured data
- It is having a built-in support for Hive, Avro, Json, etc.
- It also supports third party integration through the spark packages
- It also supports smart sources.
- It has a feature of data abstraction and domain specific language used for structured data.
- DataFrame API will be distributed collection of data in the form of named column and row.
- This is analysed like Apache spark transformations and used through SQL context and hive context.
2. DataFrame API
A DataFrame is a collection of data organized into named columns. This is equivalent to a relatable table in sql used for storing data into tables.
3. SQL Interpreter And Optimizer
SQL Interpreter and also Optimizer that is based on the functional programming that is built in scala.
- It will show the newest and most technically evolved component of Spark SQL.
- This gives framework for transforming trees, which is used to perform analysis and run time code spawning.
- It supports cost based optimisation for making queries run as much as faster than RDD.
4. SQL service
SQL service is beginning for working along with structured data in spark. It has creation of DataFrame objects as well as the execution of SQL queries.
Features of Spark SQL
1. Integration with spark
Spark SQL queries are integrated with the spark programs. Spark SQL – have spark programs have query structured data by having sql or DataFrame API. It can be used in Java, Scala and Python and R. This can be used to execute streaming calculation developers write a batch computation against the DataFrame API.
2. Uniform data access
DataFrames and sql assistance will have a common way of access and also numerous of data sources like hive, avro, json and JDBC.
1. What is Spark SQL?
2.How Spark SQL will work?