Photo by Jessica Pamp on Unsplash

Hello, Macbeth! Your First Data Science Project

John Paul Hernandez Alcala
4 min readApr 19, 2021

--

While going through my GitHub profile, I ran across my first ever project from Flatiron School’s data science program; it was a rudimentary analysis of the tragedy, Macbeth, but I remember my feeling of accomplishment when I completed it and how exciting it was to see data science applied to something non-conventional. Below, I would like to revisit the project and take you on the journey that I embarked on more than 1 year ago.

NOTE: I will be using lists, dictionaries, conditionals, and matplotlib to visualize the data from the play, so be excited and prepared to see all that!

As I discussed in my previous blog, we will follow the data process or lifecycle with the exception of the creation of a model.

“Business” Understanding

The desired outcome for this project is to produce the 25 most common words in Macbeth.

Data Mining

We will import the library requests and get the transcript of Macbeth from Project Gutenberg. Then we will call on the requests.get() method to send a GET request to the above website and in turn send back a status code. We extract the content (text) of the response by appending .text at the end. Finally, we display the data type of the data in the macbeth variable, the amount of characters in the play, and the first 500 lines:

#Library to send GET requests to websites
import requests
#assign text to variable
macbeth = requests.get('http://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
print('''This is the datatype of the variable, macbeth: {} \n
This is how many characters including spaces are in the play: {} \n
This is what the first 500 lines are: {}
'''.format(type(macbeth), len(macbeth), macbeth[:500]))
Jupyter Notebook Output

Data Cleaning/”Feature Engineering”

We make the data comprehensive by fixing values that don’t fit the data as a whole. For example, we remove punctuation and make sure ‘the’ and ‘The’ are not considered different words. Then, we count all the unique words and sort them from most to least to make our data meaningful than when it was completely raw. Finally we isolate just the most common 25 words. I used the regular regression library, re, and string.punctuation to look for !”#$%&’()*+,-./:;

NOTE: Follow along with my comments, try out pieces of my code, and see how each part works.

import re 
import string
#Splitting words from string while removing ALL punctuation #including possession punctuation
string_to_words = re.sub('['+string.punctuation+']', '', macbeth).split()
#Lower casing all words
lowered_words=[]
for string_to_word in string_to_words:
lowered_words.append(string_to_word.lower())
#Getting a list of unique words
unique_words = set(lowered_words)
#Creating a dictionary of those unique words
unique_dict = {}
for unique_word in unique_words:
unique_dict[unique_word] = 0
#Counting and updating frequency number of each unique word in Macbeth
for lowered_word in lowered_words:
if unique_dict.get(lowered_word, "Error") >= 0:
unique_dict[lowered_word] += 1
else:
print("Error: there are words in Macbeth not in the made dictionary")
#Ordering from most frequent to least
descending_word_freq = dict(sorted(unique_dict.items(), key=lambda x: x[1], reverse=True))
#Isolating only the 25 most frequent unique words
x = list(descending_word_freq.keys())[:25]
y = list(descending_word_freq.values())[:25]

Data Exploration/Data Visualization:

Last but not least, we plot our data to understand the data and make the necessary hypotheses given what we observe. We use the matplotlib library to plot and the magic command %matplotlib inline so we can see our plot in a Jupyter Notebook.

import matplotlib.pyplot as plt
%matplotlib inline
#plotting bar graph
plt.rcParams.update({'font.size': 30})
plt.figure(figsize = (30,30))
plt.barh(x[::-1], sorted(y))
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Top 25 Most Common Words in Shakespeare's Macbeth")
plt.show()
Jupyter Notebook Output

From the above visualization, we see that most words in this old tragedy are perhaps ironically many common words we still use today! And yes, ‘a’ is a word. This was just as shocking to me as it was hopefully to you. Congratulations! You did your first data science project! Although we did not do predictive modeling or a lot of feature engineering/data exploration, this is a great simple, and quick overview of how a data science project is done.

Photo by Sigmund on Unsplash

--

--

John Paul Hernandez Alcala

An intraoperative neuromonitor who tinkers with data to see what interesting nuggets he can find.