Spark in Less Than 5 Mins with Google Colaboratory

John Paul Hernandez Alcala
6 min readMay 8, 2021

NOTE: Skip to PySpark Installation Instructions in Google Colab if you are in a rush

The Not-So-Fun Waiting Game

You are working on a project in a Jupyter Notebook. You are exciting to see what comes out of this new analysis. Perhaps you read about a new modeling or data cleaning approach. You are progressing through the data lifecycle, but you find yourself sitting there waiting for that little asterisk to change back to normal which signifies the computation is complete.

Like waiting for a stop light

During that waiting time, perhaps you think that upgrading your computer would be worth an investment because this waiting is killing your progress and motivation. What if I told you, there was another way? A way that leads to faster computation without having to save up money and invest in upgrading your rig or having to research if that CPU sale is the best deal or not. And the other great thing: you can do it right now.

Photo by Markus Spiske on Unsplash

Enter the Google Colaboratory

Google Colaboratory or Colab is a product from Google Research.

Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education.

Advantages:

  • No system setup because executable from browser.
  • Free GPU (Nvidia K80s, T4s, P4s and P100) resources for faster processing.
  • Importable snippets of code to make coding faster.
  • Shareable through Google Drive or loaded from GitHub.
  • Many more!

Great, so now you do not have to give up on investing in Dogecoin or that trip that got cancelled because of COVID-19, and you can spend more time progressing instead of waiting, but this is not the end to the rabbit hole if you have a lot of data of data that needs analyzing.

Photo by Daniel Lim on Unsplash

Go Even Faster with Apache Spark in Colab

Apache Spark is a unified analytics engine for large-scale data processing; it achieves high performance for both batch and streaming data by using state-of the-art DAG scheduler, a query optimizer, and a physical execution engine

Advantages:

  • Dataframes are designed for processing large collection of structured or semi-structured data.
  • Observations in Spark Dataframe are organised under named columns, which helps Apache Spark to understand the schema of a Dataframe. This helps Spark optimize execution plan on these queries.
  • DataFrame in Apache Spark has the ability to handle petabytes of data.
  • DataFrame has a support for wide range of data format and sources.
  • It has API support for different languages like Python, R, Scala, Java.
  • Many more!

What all this means is that if you have large datasets, Spark may be able to help you process it faster with PySpark, an interface for Spark in Python. I say may because in some cases good ol’ Pandas will be better.

PySpark Installation Instructions in Google Colab

So enough of all the talk. Let us get right down to setting everything up. First click here and select NEW NOTEBOOK

View of a New Notebook

Connect Colab to Google Drive

We need to connect our Google Drive to Colab, so we can access all files in the Drive either by using the below code or click ingon folder in the left panel and then the folder with the Drive symbol. Select CONNECT TO GOOGLE DRIVE

from google.colab import drivedrive.mount('/content/drive')
Mount Google Drive to Colab

Import Data into Colab

Then we need to know how to import data. This can be done first by zipping the datafile (if it is very large), uploading to our Drive, selecting the file with the pointing-up arrow, or dragging it to the left panel, and then using the code below with the associated path to the zipped file. You can do this by locating the file and right-clicking to copy the path.

!unzip “[path to zipped data]”

Before PySpark Set Up

To set-up PySpark, we must make sure to update and/or download package information from all configured sources.

  • Update — updates available packages and their versions, but it does not install or upgrade any packages.
  • Upgrade — installs newer versions of the packages you have.
!apt update
Output
!apt upgrade
Output

PySpark Installation

Here we will install PySpark by first installing Java because Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Then we will choose the Spark version of your choice.

NOTE: copy the path to your specific Spark version at http://apache.osuosl.org/spark

# install Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# install Spark 3.0.0 (change the version number if needed @ http://apache.osuosl.org/spark/)!wget -q http://apache.osuosl.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz# unzip the spark file to the current folder
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
# install findspark using pip
!pip install -q findspark

Environmental Variables

After installing PySpark, we have to set up our environment variables; these environment variables are managed outside the context of this process, and they setup how Python will behavior from now and onward. This is like adjusting settings in a video game. In Colab, this will have to be run each session.

import osos.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.8.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

Moment of Truth!

Now we check if PySpark has been correctly installed by seeing if PySpark is importable.

import findsparkfindspark.init()
print(pyspark.__version__)
Output

No errors and the output of a version number means it has been correctly installed!

Trying Out PySpark

Here we initiate a SparkSession that will allow us to make use all the functions in PySpark.

from pyspark.sql import SparkSession# Create SparkSession which is an entry point to Spark
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark
Output

Dataframe in PySpark

With that Spark object we can create a dataframe by defining or reading from a CSV file.

# Create example df
df = spark.createDataFrame([{"Geeting": "Hello World!"} for x in range(5)])
df.show(3, False)# Reading in pyspark df
spark_df = spark.read.csv('/content/sample_data/california_housing_test.csv', header='true', inferSchema='true')
spark_df.show(3, False)
Output

With our dataframe, we can access our data in similar ways as we have done with Pandas and work on it with packages such as MLlib.

After the Analysis Dust Has Settled

Once we have completed our analysis on the dataframe, we can save our progress to a CSV file.

# Spark df to Pandas df
df_pd = df.toPandas()
# Store result
df_pd.to_csv("/content/drive/My Drive/AV articles/PySpark on Colab/pandas_preprocessed_data.csv")

Finally, when we are done with our SparkSession, we need to close it down (we won’t be able to create another one without doing so).

spark.stop() #stops RDD, so we can create a new one

GitHub Integration Information

We can also push all our work to GitHub, so our work can be accessible to the community.

Share to GitHub

And that is it! Please remember to thumbs this up and check out my other articles for all things data science.

Photo by Joshua Rawson-Harris on Unsplash

--

--

John Paul Hernandez Alcala

An intraoperative neuromonitor who tinkers with data to see what interesting nuggets he can find.