Setting up a Standalone Spark Cluster

Getting some dusty laptops to work

MM

This is a project in collaboration with SmellsLikeML.

Having AWS at your disposal to spin up a spark cluster is great, but what happens when you don't want to pay extra to keep a cluster up in the cloud? We knocked the dust off of some old laptops laying around, revamped with fresh ubuntu 16.04 installs and now we have some mean computing power at our disposal.

The Set Up

We had some old hardware lying around that needed some serious updates. We went for a fresh Ubuntu 16.04 installs to make things smoother. Make a live usb to install onto your machines. Now we have to configure our fresh installs with some basics.We use Anaconda3 for our python library needs.


#In your terminal

wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
bash Anaconda3-4.4.0-Linux-x86_64.sh

#remove the package after the install
rm Anaconda3-4.4.0-Linux-x86_64.sh
default on everything and make sure to hit yes when it asks you to append PATH to bashrc

We have to set a static ip so that the Master node knows where to find the slaves. Let's set up static IPs. We have to disable visual network manager so that we can make changes through the terminal. So open up the NetworkManager.conf file


vim /etc/NetworkManager/NetworkManager.conf
We want to comment out one line. Your file should look something like this:

[main]
plugins=ifupdown,keyfile
#dns=dnsmasq
 
[ifupdown]
managed=false
We need a couple of pieces of information. I set up static IPs over wifi because I didn't want to have all of my machines connected to ethernet. If you prefer you can configure this to whatever device you prefer. First, look at the devices you have set up right now. Run ifconfig to see what devices are up. eth0 is an ethernet device and wlan0 is a wireless device. Save the name of the device you want to configure for later. You also want to save an ip address that you want your machine to have. I'll use 192.168.0.13 as an example.

You'll need your network gateway and your netmask. Run ip route show for the gateway. It'll be the ip address in the first line. For the netmask run ifconfig <iface> | grep Mask. This is normally 255.255.255.0; it's highly probable this will be yours too. Now to set up the static ip. You'll want to open up the interfaces file with vim /etc/network/interfaces. Fill in the blanks like so:


auto lo
iface lo inet loopback

auto DEVICENAME
iface DEVICENAME inet static
address 192.168.0.XXX
gateway 192.168.0.1
netmask 255.255.255.0
dns-nameservers 8.8.8.8 8.8.4.4
To update these changes, run

systemctl restart network-manager.service
systemctl restart networking.service
systemctl restart resolvconf.service
You should now have your static ip show up when you hit ifconfig again. If your chosen ip address doesn't show, reboot your machine and try again.

Next step is setting up passwordless ssh. Your Master node needs to be able to talk to all the other worker (slaves) without having to input a password. Follow this good guide on how to do just that.

Installing Spark

Now for the main course - the spark install:
Install Java 8


$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
Install Maven and some linear algebra libraries

$ sudo apt-get install maven
$ sudo apt-get install libatlas3-base libopenblas-base
Installing Spark

$ sudo mkdir /usr/local/share/spark
$ sudo su
$ curl http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz | tar xvz -C /usr/local/share/spark
$ cd /usr/local/share/spark/spark-2.0.2
$ ./build/mvn -DskipTests clean package
Adding environment variables to the bashrc file

vim ~/.bashrc

#At the end of the bashrc file
export SPARK_HOME=/usr/local/share/spark/spark-2.0.0
export PATH=$SPARK_HOME:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export PATH=$JAVA_HOME/bin:$PATH
Change the ownership of the SPARK_HOME directory

$ cd $SPARK_HOME
$ sudo chown -R $USER:$USER .
Installing Scala now

$ wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.deb
$ sudo dpkg -i scala-2.11.8.deb
$ sudo apt-get install -f
$ sudo apt-get autoremove
$ rm scala-2.11.8.deb
Now add scala build tools

$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
$ sudo apt-get update
$ sudo apt-get install sbt
Last few steps are declare the master host and the slaves. For the master host open the spark-env.sh file with vim $SPARK_HOME/conf/spark-env.sh and write

SPARK_MASTER_NODE="123.456.7.89"
Where the dummy ip address is your master node ip. Save that file and open a slaves file vim $SPARK_HOME/bin/slaves and write the IP addresses of each slave per line. Finally, run ./sbin/start-all.sh from $SPARK_HOME and you'll be able to see all your computers up on the cluster by navigating to localhost:9080 and see something like this: