Getting Started With Apache Spark Using R (Local)

In this post I will go over installing Apache Spark and initial interactions from within R. I am currently using Linux/Ubuntu 20.04 so the instruction are tailored to my environment. The process should be similar to other Linux distributions as well as Mac and Microsoft environments.

Getting Apache Spark

There are a couple routes to getting Apache Spark. The first is directly from the Apache website, the other is through the R package sparklyr. Details on each method will be provided in the following subsections.

In Preparation

Java

You need to have Java install on your machine and the JAVA_HOME environment variable should be defined. I have both java 8 and 11 installed on my machine. I have executed the following post with Java 8 (OpenJDK AMD64)

To locate your Java/JVM path use the following command in the terminal. Depending on how you installed it, it will likley be in /usr/lib/jvm/.

locate /jvm/java-

If you don’t have java on your machine you can get it from the java website or apt.

sud apt install openjdk-8-jre # or whichever version you need 8, 11, 13, 14

JDK is a superset of JRE. You should only need the JRE but if you want additional functionality, you can go with openjdk-8-jdk. To look at all the different options just double tab after the openjdk- to give you a list of the software options.

Environment Variables

Next we prepare for getting Spark on the local computer. I created a directory and assigned an environment variable. I derived this step after reviewing the materials in a couple resources.

# Create a spark folder in the home directory
mkdir spark
cd spark

Next, we add the directory location to the PATH. To do that we first open either the /etc/profile or the ~/.bashrc files.

nano ~/.bashrc # or /etc/profile

At the bottom of either file, insert the following commands. I already inserted what the installed Spark instance directory looks like here. You can perform this part after if you would like as well. If you unpack the downloads in either method below to just reside in the higher level “spark/” directory, its unnecessary to include the additional refinement.

export SPARK_HOME=/home/<user>/spark/ # spark-3.0.1-bin-hadoop2.7 # optional 
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$SPARK_HOME/bin:$JAVA_HOME/bin

Now source the file to make SPARK_HOME and JAVA_HOME available. This will allow you to reference Spark using $SPARK_HOME in the command line, but other applications will also use this to launch Spark as well.

source ~/.bashrc

Getting Spark from Apache

To get Apache Spark directly, visit their website at https://spark.apache.org/ and navigate to the “Downloads” page (https://spark.apache.org/downloads.html)

Once on their downloads page select the versioning combination (Spark & Hadoop) you would like to download. Since I don’t have either I downloaded both, but will install Spark with Hadoop version 2.7.

# Hadoop v2.7
wget https://mirror.jframeworks.com/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

# Hadoop v3.2
wget https://mirror.olnevhost.net/pub/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

Get the Keys to verify the downloads.

wget https://downloads.apache.org/spark/KEYS

# Hadoop v2.7
wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz.asc

# Hadoop v3.2
wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz.asc

Verify the downloads to ensure they are authentic.

# Hadoop v2.7
gpg --verify ./spark-3.0.1-bin-hadoop2.7.tgz.asc ./spark-3.0.1-bin-hadoop2.7.tgz

# Hadoop v3.2
gpg --verify ./spark-3.0.1-bin-hadoop3.2.tgz.asc ./spark-3.0.1-bin-hadoop3.2.tgz

To install from the CLI running the following IF the downloads have been authenticated.

# Hadoop v2.7
tar xvzf spark-3.0.1-bin-hadoop2.7.tgz

# Hadoop v3.2
tar xvzf spark-3.0.1-bin-hadoop3.2.tgz

Getting Spark Using R

Alternatively we can also get Apache Spark from within R.

R currently has two packages for working with Spark, sparkR and sparklyr. They both provide a front-end interface to using Apache Spark. sparkR is produced by Apache whereas sparklyr is produced by RStudio. The two packages have some difference and you can use them together simultaneously. Please reference Databricks documentation for more details as well:

In this post I will use sparklyr to install Apache Spark version 3.0.1, as demonstrated in the previous section.

`sparklyr`

Using the R/RStudio interface, you can also download Spark using the sparklyr library. At the time of this writing, the current package version was at 1.5.2. To get the sparklyr library execute the following in R/RStudio:

install.packages("sparklyr") # v1.5.2

# or for the latest version use
devtools::install_github("rstudio/sparklyr")

When the package is installed you can run the following to install Spark.

library("sparklyr")

# Provides a list of available versions to download
spark_available_versions()

# Install the desired version
spark_install(version = "3.0.1", hadoop_version = "2.7")

# or another version
spark_install(version = "3.0.1", hadoop_version = "3.2")

Once the download is complete you will see the following output:

At this point, Spark is ready to use. The package already unpacked the tarball into the SPARK_HOME.

`sparkR`

For more information on the package and setup, please refer to the documentation at https://spark.apache.org/docs/latest/sparkr.html. For this package you will need to install Apache Spark using the method above before connecting and operating with Spark.

install.packages("sparkR")

At the time of this writing the current package is not support for my current R version. You can however get the package when you install spark using the method below in the sparklyr package. Once Spark has been installed, you can check out the directory and copy into your R library directory.

Using the following command you can copy over the package directory to your R library.

cp -r $SPARK_HOME/R/lib/SparkR/ R/x86_64-pc-linux-gnu-library/4.0/

Connecting to Spark in R (local mode)

Since I am not connected to a standalone cluster, cloud configuration or have access to external resources, I will create a connection for my local Spark instance. This allows me to test and tinker with Spark on my local machine while I learn the operations before increasing capacity. In a follow-up post, I will cover creating a standalone cluster with my other computers on my home network. See function documentation for connecting to remote Spark servers.

sc <- spark_connect(master = "local")

Setting up a Spark Connection in the RStudio Connections Tab

In this section, I will introduce creating an RStudio Connection to Spark. RStudio’s Connection tab allows you to connect to much more than just Spark and I will be utilizing this feature more in future posts as well.

Follow the screenshots below for establishing a connection to Spark locally. First, I select the “Connections” tab and click on the “New Connection” button. This brings up a “New Connection” window. From here, select “Spark”.

This will take you to the following configuration. Modify to suit your current setup, then click “OK”.

If the configuration was successful, you should see the following in your connection pane.

From here, you can click on the connection. You should see the following:

Connecting

Click on the “Connect” button to preview the options. “R Console” will output the connection commands to the R Console. Alternatively you can select “Copy to Clipboard” so you can paste into your desired script or elsewhere.

When you connect for the first time, there will be a lot of initial output to the console with Spark performing its configuration. The updated connection panel, will reflect that you are connected.

When you select the connection, you will see the following view in the pane. From here, you can launch the web user interface by clicking the “Spark” button on the left-hand side.

The resultant web UI should look like this. Explore each of the tabs in the UI to get familiar with it. On the home screen (ie “Jobs” tab) you can observe what jobs are running or active, completed and an event timeline.

Now that everything appears to be running, we can conduct a simple example to demonstration operations.

Test Example

Load some packages.

library(tidyverse)
library(magrittr)

Using the initial example from the “Learning Spark Book”, which shows the Python and Scala methods, here we can see how it is performed in R.

readme_strings <- 
   spark_read_text(sc = sc,
                   path = "/home/bear/spark/spark-3.0.1-bin-hadoop2.7/README.md")

When we execute this we should see this in the connection pane.

In the “Jobs” tab of the web UI, we should see the following.

Now we will run the command to look at the data.

head(readme_strings, 10)

This will produce the following in the console.

Lets count the number of elements in this dataset.

count(readme_strings)

We should see in the console that there is 108 objects. Now we will run a query against each line in the document to see how many contain the word “spark”. In this query we are actually running Spark SQL functions in the filter command (instr = in string and lower = sets the input argument to lower case). The final command counts how many lines were returned, which should produce 31.

filtered_spark <- 
   readme_strings %>% 
   filter(instr(lower(line), "spark") >  0) %>% 
   pull(line)

filtered_spark %>% length

To learn about the SparkSQL functions available, please reference here (official documentation).

We can look at the event timeline in the web UI and see the jobs and when they were executed. There are additional jobs on my image because I ran a couple different commands as well.

In the “Completed Jobs” sections we can see the job stats as well.

Disconnect from Spark

After we are done using Spark it is good practice to disconnect from the instance, regardless if you are operating locally or in cluster environment. To disconnect from R just include the following command in your script of run it in the console before moving on to other tasks.

spark_disconnect(sc)

After running the command, the web UI will stop working and indicate an error in the connection. You will also see the connection in the “Connection” pane change status.

Conclusion

This brief example demonstrated setting up Spark on your local machine and run your first example. The next step would be to explore additional examples, as well as to start building and testing your own scripts to get them ready for a Spark cluster environment. This local configuration will also let you troubleshoot your code to ensure what you are doing works with Spark. Once ready, you should be ready to apply your skills and scripts in a cluster.

What’s Next?

In a follow-up post I will demonstrate creating a local standalone Spark cluster using my desktop and two Raspberry Pi 3s.

There is a lot you can do with Spark so I would encourage you to explore additional resources that meet your specific needs, as well as to learn more about the functionality of Spark.

Official Resources

Sparklyr: https://spark.rstudio.com/
SparkR: https://spark.apache.org/docs/latest/sparkr.html
Mastering Spark with R: https://therinspark.com/