Photo by Jessica Pamp on Unsplash

Hello, Macbeth! Your First Data Science Project

While going through my GitHub profile, I ran across my first ever project from Flatiron School’s data science program; it was a rudimentary analysis of the tragedy, Macbeth, but I remember my feeling of accomplishment when I completed it and how exciting it was to see data science applied to something non-conventional. Below, I would like to revisit the project and take you on the journey that I embarked on more than 1 year ago.

NOTE: I will be using lists, dictionaries, conditionals, and matplotlib to visualize the data from the play, so be excited and prepared to see all that!

As I discussed in my previous blog, we will follow the data process or lifecycle with the exception of the creation of a model.

“Business” Understanding

Data Mining

#Library to send GET requests to websites
import requests
#assign text to variable
macbeth = requests.get('http://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
print('''This is the datatype of the variable, macbeth: {} \n
This is how many characters including spaces are in the play: {} \n
This is what the first 500 lines are: {}
'''.format(type(macbeth), len(macbeth), macbeth[:500]))
Jupyter Notebook Output

Data Cleaning/”Feature Engineering”

NOTE: Follow along with my comments, try out pieces of my code, and see how each part works.

import re 
import string
#Splitting words from string while removing ALL punctuation #including possession punctuation
string_to_words = re.sub('['+string.punctuation+']', '', macbeth).split()
#Lower casing all words
lowered_words=[]
for string_to_word in string_to_words:
lowered_words.append(string_to_word.lower())
#Getting a list of unique words
unique_words = set(lowered_words)
#Creating a dictionary of those unique words
unique_dict = {}
for unique_word in unique_words:
unique_dict[unique_word] = 0
#Counting and updating frequency number of each unique word in Macbeth
for lowered_word in lowered_words:
if unique_dict.get(lowered_word, "Error") >= 0:
unique_dict[lowered_word] += 1
else:
print("Error: there are words in Macbeth not in the made dictionary")
#Ordering from most frequent to least
descending_word_freq = dict(sorted(unique_dict.items(), key=lambda x: x[1], reverse=True))
#Isolating only the 25 most frequent unique words
x = list(descending_word_freq.keys())[:25]
y = list(descending_word_freq.values())[:25]

Data Exploration/Data Visualization:

import matplotlib.pyplot as plt
%matplotlib inline
#plotting bar graph
plt.rcParams.update({'font.size': 30})
plt.figure(figsize = (30,30))
plt.barh(x[::-1], sorted(y))
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Top 25 Most Common Words in Shakespeare's Macbeth")
plt.show()
Jupyter Notebook Output

From the above visualization, we see that most words in this old tragedy are perhaps ironically many common words we still use today! And yes, ‘a’ is a word. This was just as shocking to me as it was hopefully to you. Congratulations! You did your first data science project! Although we did not do predictive modeling or a lot of feature engineering/data exploration, this is a great simple, and quick overview of how a data science project is done.

Photo by Sigmund on Unsplash

An intraoperative neuromonitor who tinkers with data to see what interesting nuggets he can find.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store