This is a project in collaboration with SmellsLikeML.
Having AWS at your disposal to spin up a spark cluster is great, but what happens when you don't want to pay extra to keep a cluster up in the cloud? We knocked the dust off of some old laptops laying around, revamped with fresh ubuntu 16.04 installs and now we have some mean computing power at our disposal.
The Set Up
We had some old hardware lying around that needed some serious updates. We went for a fresh Ubuntu 16.04 installs to make things smoother. Make a live usb to install onto your machines. Now we have to configure our fresh installs with some basics.We use Anaconda3 for our python library needs.
default on everything and make sure to hit yes when it asks you to append PATH to bashrc
#In your terminal wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh bash Anaconda3-4.4.0-Linux-x86_64.sh #remove the package after the install rm Anaconda3-4.4.0-Linux-x86_64.sh
We have to set a static ip so that the Master node knows where to find the slaves. Let's set up static IPs. We have to disable visual network manager so that we can make changes through the terminal. So open up the NetworkManager.conf file
We want to comment out one line. Your file should look something like this:
We need a couple of pieces of information. I set up static IPs over wifi because I didn't want to have all of my machines connected to ethernet. If you prefer you can configure this to whatever device you prefer. First, look at the devices you have set up right now. Run
[main] plugins=ifupdown,keyfile #dns=dnsmasq [ifupdown] managed=false
ifconfigto see what devices are up. eth0 is an ethernet device and wlan0 is a wireless device. Save the name of the device you want to configure for later. You also want to save an ip address that you want your machine to have. I'll use 192.168.0.13 as an example.
You'll need your network gateway and your netmask. Run
ip route show for the gateway. It'll be the ip address in the first line. For the netmask run
ifconfig <iface> | grep Mask. This is normally 255.255.255.0; it's highly probable this will be yours too. Now to set up the static ip. You'll want to open up the interfaces file with
vim /etc/network/interfaces. Fill in the blanks like so:
To update these changes, run
auto lo iface lo inet loopback auto DEVICENAME iface DEVICENAME inet static address 192.168.0.XXX gateway 192.168.0.1 netmask 255.255.255.0 dns-nameservers 18.104.22.168 22.214.171.124
You should now have your static ip show up when you hit
systemctl restart network-manager.service systemctl restart networking.service systemctl restart resolvconf.service
ifconfigagain. If your chosen ip address doesn't show, reboot your machine and try again.
Next step is setting up passwordless ssh. Your Master node needs to be able to talk to all the other worker (slaves) without having to input a password. Follow this good guide on how to do just that.
Now for the main course - the spark install:
Install Java 8
Install Maven and some linear algebra libraries
$ sudo apt-add-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer
$ sudo apt-get install maven $ sudo apt-get install libatlas3-base libopenblas-base
Adding environment variables to the bashrc file
$ sudo mkdir /usr/local/share/spark $ sudo su $ curl http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz | tar xvz -C /usr/local/share/spark $ cd /usr/local/share/spark/spark-2.0.2 $ ./build/mvn -DskipTests clean package
Change the ownership of the SPARK_HOME directory
vim ~/.bashrc #At the end of the bashrc file export SPARK_HOME=/usr/local/share/spark/spark-2.0.0 export PATH=$SPARK_HOME:$PATH export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export PATH=$JAVA_HOME/bin:$PATH
Installing Scala now
$ cd $SPARK_HOME $ sudo chown -R $USER:$USER .
Now add scala build tools
$ wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.deb $ sudo dpkg -i scala-2.11.8.deb $ sudo apt-get install -f $ sudo apt-get autoremove $ rm scala-2.11.8.deb
Last few steps are declare the master host and the slaves. For the master host open the spark-env.sh file with
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823 $ sudo apt-get update $ sudo apt-get install sbt
vim $SPARK_HOME/conf/spark-env.shand write
Where the dummy ip address is your master node ip. Save that file and open a slaves file
vim $SPARK_HOME/bin/slavesand write the IP addresses of each slave per line. Finally, run
./sbin/start-all.shfrom $SPARK_HOME and you'll be able to see all your computers up on the cluster by navigating to localhost:9080 and see something like this: