HOW TO SETUP APACHE SPARK STANDALONE CLUSTER ON MULTIPLE MACHINE

               sparkkk

Scenario :- Consider a scenario that you want to give proof of concept to your boss or team lead about why to use Apache Spark and also want to leverage complete power of Apache Spark but don’t know how to setup Spark cluster than is the right place for you.

Another scenario can be if you want to process large amount of data iteratively on multiple nodes(machines) but you are not able to do setup your Apache Spark cluster than this is the place to start.

This document is to describe steps to setup spark cluster on multiple machines and submitting your first job to spark.This document is basically for newbies who don’t know about how to setup spark cluster and because Apache Spark documentation only tells about how to start spark-shell and submitting jobs to Spark Master running locally.

In this document i tried to solve this problem.This is single document to get started on Apache Spark with basic knowledge of internals.

 

Question 1. What is Apache Spark ? why should i consider it over Apache Hadoop ?

Apache Spark is Distributed in-memory computation framework. Apache Hadoop has following components –

  1. Hadoop Map-Reduce – Method for distributed computation.
  2. HDFS – Hadoop Distributed file system
  3. Yarn – Cluster management system.

Apache Spark does not replace Apache Hadoop but it leverages Apache Hadoop in the following way

  1. Spark  – In-memory distributed computation.
  2. HDFS – Hadoop Distributed file system
  3. Yarn –  Cluster management system.

The problem with Hadoop Map-Reduce is that output data on job completion is moved to disk. Now consider a case where there are multiple jobs and after completion of each of such job output will be written to disk and if output of first job is needed by second job than it has to be read from disk which makes time taken to complete multiple hadoop jobs proportional to read/write operation of disk.

time taken ∝ read/write operation of disk

Here comes the Apache Spark where output from completion of one job is not written to disk rather it is persisted in memory so that other jobs can use the same data present in memory.

 

STEPS TO SETUP SPARK CLUSTER ON MULTIPLE MACHINES

Before starting let me introduce to terminologies in Apache Spark

Spark Worker – cluster node which actually executes the task.
Spark Master – management of resources i.e worker nodes.
Spark Driver – Client application which ask for resources from spark master and executes task on worker nodes.

cluster-overview

There are three different types of cluster manager in Apache Spark

  1. Spark-Standalone – Spark workers are registered with spark master
  2. Yarn – Spark workers are registered with YARN Cluster manager.
  3. Mesos – Spark workers are registered with Mesos.

We will setting up Spark-Standalone here

(I will be using Spark-1.4.0 and Host Operating System is Linux mint)

  1. Download spark binary from their Official Site.
  1. Check if java is installed if not than you can download it using the following command.

sudo apt-get install default-jdk or go to this Link.

  1.  Set environment variable JAVA_HOME to where your java is installed.It will be usually you /usr/lib/jvm/java-7-openjdk-i386/ or /usr/lib/jvm/java-7-openjdk-i686/ depending on your architecture. To set Environment variable go to /home/USERNAME/.bashrc file and at end of file write

JAVA_HOME = PATH_OF_JAVA

  1.   Now go to Spark HOME_DIRECTORY/sbin and run following command from terminal

./start-master.sh

Running above command will start your spark-master at your localhost:8080 and incase you want to register any worker to spark-master from remote computer you will not be able to do that because when trying to submit worker to master you use same url as is shown in localhost:8080 in spark-master which is like spark://HOST_NAME:7077 when you do this from remote computer it will not be able to recognize HOST_NAME hence there is problem.To mitigate this run following STEPS:-
  1. Go to SPARK_HOME/conf/ and create new file with name spark-env.sh.

There will be spark-env.sh.template in same folder and this file gives you detail on how to declare various environment variables.

2.    Now we will give SPARK_MASTER_IP=’IPADRESS_OF_YOUR_MASTER_SYSTEM’

and than start your spark-master using same command as mentioned in step 4.

3.     HIT localhost:8080 from browser in spark-master machine you will see spark-url to be

                                 spark://IPADRESS_OF_YOUR_MASTER_SYSTEM:7077

Here is your configuration file looks like in conf/ folder

NOTE :- By default there will spark-env.sh.template.please create new file with following name spark-env.sh

1

 

Spark Master in Web Interface In Browser

img2

 

  1. Start spark-workers from other nodes of cluster.

                In SPARK_HOME/sbin/ dir

               ./start-slave.sh spark://IPADRESS_OF_YOUR_MASTER_SYSTEM:7077

img3

  Apache Spark master web interface after worker is registered

img4

 

  1. Last and final step submit Spark application using any of api’s like JAVA, SCALA, PYTHON etc.
         You have to give SparkMasterUrl same as when registering worker with Spark-Master.

 

 

Conclusion :- For newbies who want to explore Apache Spark but have problem setting it up in Standlone-Cluster. I hope you are able to setup your Spark cluster have submitted your first Apache Spark Application to Apache Spark Cluster.

 

Where to go after this

  1. Basic introduction to Apache Spark
  2. Apache Spark Internals

6 thoughts on “HOW TO SETUP APACHE SPARK STANDALONE CLUSTER ON MULTIPLE MACHINE

  1. Hi,
    very good tutorial!
    I’m having some issues though. I’ve done everything you said so on my master’s conf file I have SPARK_MASTER_IP=’xx.yy.zz.tt’
    and I start the master and everything works fine. But when I start a slave with the right ip address on the other machine it doesn’t show on my master’s page(ip:8080). Any idea why?

    • Hi,
      Sorry for being late and Thanks that you liked.
      Two things might be causing issue : –
      1. when starting your slave, have u given the IP address of spark master ?
      2. if you have the correct IP address of spark master then check if you are able to ping the master machine from slave machine.

Leave a Reply

Your email address will not be published. Required fields are marked *


+ eight = 15

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>