Spark in Less Than 5 Mins with Google Colaboratory
NOTE: Skip to PySpark Installation Instructions in Google Colab if you are in a rush
The Not-So-Fun Waiting Game
You are working on a project in a Jupyter Notebook. You are exciting to see what comes out of this new analysis. Perhaps you read about a new modeling or data cleaning approach. You are progressing through the data lifecycle, but you find yourself sitting there waiting for that little asterisk to change back to normal which signifies the computation is complete.
During that waiting time, perhaps you think that upgrading your computer would be worth an investment because this waiting is killing your progress and motivation. What if I told you, there was another way? A way that leads to faster computation without having to save up money and invest in upgrading your rig or having to research if that CPU sale is the best deal or not. And the other great thing: you can do it right now.
Enter the Google Colaboratory
Google Colaboratory or Colab is a product from Google Research.
Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education.
Advantages:
- No system setup because executable from browser.
- Free GPU (Nvidia K80s, T4s, P4s and P100) resources for faster processing.
- Importable snippets of code to make coding faster.
- Shareable through Google Drive or loaded from GitHub.
- Many more!
Great, so now you do not have to give up on investing in Dogecoin or that trip that got cancelled because of COVID-19, and you can spend more time progressing instead of waiting, but this is not the end to the rabbit hole if you have a lot of data of data that needs analyzing.
Go Even Faster with Apache Spark in Colab
Apache Spark is a unified analytics engine for large-scale data processing; it achieves high performance for both batch and streaming data by using state-of the-art DAG scheduler, a query optimizer, and a physical execution engine
Advantages:
- Dataframes are designed for processing large collection of structured or semi-structured data.
- Observations in Spark Dataframe are organised under named columns, which helps Apache Spark to understand the schema of a Dataframe. This helps Spark optimize execution plan on these queries.
- DataFrame in Apache Spark has the ability to handle petabytes of data.
- DataFrame has a support for wide range of data format and sources.
- It has API support for different languages like Python, R, Scala, Java.
- Many more!
What all this means is that if you have large datasets, Spark may be able to help you process it faster with PySpark, an interface for Spark in Python. I say may because in some cases good ol’ Pandas will be better.
PySpark Installation Instructions in Google Colab
So enough of all the talk. Let us get right down to setting everything up. First click here and select NEW NOTEBOOK
Connect Colab to Google Drive
We need to connect our Google Drive to Colab, so we can access all files in the Drive either by using the below code or click ingon folder in the left panel and then the folder with the Drive symbol. Select CONNECT TO GOOGLE DRIVE
from google.colab import drivedrive.mount('/content/drive')
Import Data into Colab
Then we need to know how to import data. This can be done first by zipping the datafile (if it is very large), uploading to our Drive, selecting the file with the pointing-up arrow, or dragging it to the left panel, and then using the code below with the associated path to the zipped file. You can do this by locating the file and right-clicking to copy the path.
!unzip “[path to zipped data]”
Before PySpark Set Up
To set-up PySpark, we must make sure to update and/or download package information from all configured sources.
- Update — updates available packages and their versions, but it does not install or upgrade any packages.
- Upgrade — installs newer versions of the packages you have.
!apt update
!apt upgrade
PySpark Installation
Here we will install PySpark by first installing Java because Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Then we will choose the Spark version of your choice.
NOTE: copy the path to your specific Spark version at http://apache.osuosl.org/spark
# install Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null# install Spark 3.0.0 (change the version number if needed @ http://apache.osuosl.org/spark/)!wget -q http://apache.osuosl.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz# unzip the spark file to the current folder
!tar xf spark-3.1.1-bin-hadoop3.2.tgz# install findspark using pip
!pip install -q findspark
Environmental Variables
After installing PySpark, we have to set up our environment variables; these environment variables are managed outside the context of this process, and they setup how Python will behavior from now and onward. This is like adjusting settings in a video game. In Colab, this will have to be run each session.
import osos.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.8.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"
Moment of Truth!
Now we check if PySpark has been correctly installed by seeing if PySpark is importable.
import findsparkfindspark.init()
print(pyspark.__version__)
No errors and the output of a version number means it has been correctly installed!
Trying Out PySpark
Here we initiate a SparkSession that will allow us to make use all the functions in PySpark.
from pyspark.sql import SparkSession# Create SparkSession which is an entry point to Spark
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark
Dataframe in PySpark
With that Spark object we can create a dataframe by defining or reading from a CSV file.
# Create example df
df = spark.createDataFrame([{"Geeting": "Hello World!"} for x in range(5)])df.show(3, False)# Reading in pyspark df
spark_df = spark.read.csv('/content/sample_data/california_housing_test.csv', header='true', inferSchema='true')spark_df.show(3, False)
With our dataframe, we can access our data in similar ways as we have done with Pandas and work on it with packages such as MLlib.
After the Analysis Dust Has Settled
Once we have completed our analysis on the dataframe, we can save our progress to a CSV file.
# Spark df to Pandas df
df_pd = df.toPandas()# Store result
df_pd.to_csv("/content/drive/My Drive/AV articles/PySpark on Colab/pandas_preprocessed_data.csv")
Finally, when we are done with our SparkSession, we need to close it down (we won’t be able to create another one without doing so).
spark.stop() #stops RDD, so we can create a new one
GitHub Integration Information
We can also push all our work to GitHub, so our work can be accessible to the community.
And that is it! Please remember to thumbs this up and check out my other articles for all things data science.