How to Use the OMDb API in Python and Refine Data Mining Process Part 1 of 2
If you are rushed, just scroll down to OMDb API Step-by-Step
Motivation
About every weekend, my girlfriend and I head over to my parents’ house to have dinner together, visit, and watch a movie together. This weekend however, we had great difficulty finding a movie we all considered a good movie; we shifted through different recommendation lists we found online. We looked at three main features:
- Rating Scores from Rotten Tomatoes and IMDb
- Genre (because my mother cannot stomach horror and she is not the biggest fan of Sci-Fi)
- Maturity Rating (used to avoid awkward explicit scenes)
As we checked each prospective movie, I was reminded of my Film Industry Analysis & Insights project. I have talked superficially about this project before in a previous blog, What an Adventure!. In this article, I just mentioned that I used the OMDb API and violin plots in this project, and I cited the resources I used to create my README.md for that project. In this part 1 article, I want to talk more about interacting with the OMDb API to data mine and how to go about formatting that data into a dataframe. Let’s get started!
OMDb API Step-by-Step
Import requests
In the first pass of my project, I used a Python wrapper for OMDb API called ombd. A Python wrapper or decorator is basically a function that takes in another function and returns a function that has been modified by that taken in function. See Understanding Python Decorators in 12 Easy Steps! However, when I tried running it more recently, the wrapper was having errors with sending certain parameters. My guess is that it has not been updated, and it was having compatibility issues. So, to avoid these issues now and in the future, I imported the requests library to send GET requests directly to the OMDb site which is more reliable than my original approach since it cuts out the middleman.
import requests
Get API Key for OMDb
We need this key in order to make GET request to the OMDb API. This key is a mixture of letters and numbers, and it can be obtained by inputting your email here Keep in mind the 1,000 daily limit and give yourself time to mine for data.
Create a Secret Folder and Store API Key There
One thing to keep in mind is that an API key should never be hardcoded into your project. If for example you are uploading your project onton GitHub, it is completely public and vulnerable to abuse; as such, it is a good idea to create a new secret folder where you can import the key from. Below I created one in my home directory using Git Bash.
cd ~mkdir .secret
You can check if it has been successfully created by first checking with this while in your home directory:
ls
If it is not there, check with this
ls -a
If it is listed now, then you successfully created a secret folder for your API keys! Inside your secret folder, create a new .txt file and paste your key in there.
Import OMDb API Key and Set Key as Default
Here we open our .txt file with the API key in it and store it in the variable API_KEY.
#requires valid API_KEY to run
f = open(‘C:/Users/[YOURusername]/.secret/OMDb_API.txt’, ‘r’)
API_KEY = f.read()
Search for Movie with Title and Year
title = 'Avatar'
year = '2009'
movieInfo = requests.get('http://www.omdbapi.com/?apikey='+API_KEY+'&t='+title+'&y='+year).json()
movieInfo
If you get the above output, then you are set up to make requests through the OMDb API!
There are many other ways you can search for movies: IMDb ID, year, genre, media type and more. See documentation for a full list.
Data Mining Refinement
In my specific situation, I already had a Pandas dataframe with more movie details that had the following structure and with the goal in mind of combining all this data into one dataframe:
So I extracted all the titles and years from the dataframe to create two lists.
#we will make a list of our movie titles
movieTitles = list(df2.movie)#Below looks at the release_date column of each movie and splits off the year for each moive titlemovieYears = [df2.release_date.iloc[x].split()[2] for x in range(0, len(movieTitles))]
Then I combined the two lists to make a list of tuples.
movTY = list(zip(movieTitles, movieYears))
#the zip() function helps match the title and yearmovTY
I made a template for a new dictionary (and ultimately a new dataframe) to be built from the JSON response we got for the Avatar movie from OMDb. Below a copy is made to prevent data mutation.
movieDetails = movieInfo.copy()
Next, I defined a function that will make sure each key from our template dictionary or JSON, movieDetails, that will match each key from the OMDb response we obtain which is the variable, movieInfo2:
def moviesDict(movieDetails, movieInfo2, OrgMovieTitle):
if len(movieDetails) > len(movieInfo2): #This fills in fields that might not have been present in template
for key in movieDetails.keys(): #Goes through each key in template
if movieInfo2.get(key): #Checks if template key is present in JSON response
#Do nothing since this key is present;
;
else:#create this key so our column sizes match for each movie title
movieInfo2[key] = 'N/A' #This will also fill in movies that do not come up in the API
#
for key, value in movieDetails.items():
for key2, value2 in movieInfo2.items():
if key == key2:
if value == value2:#this part turns the initial value into a list for development of the columns
if key == 'title':
movieDetails[key] = [OrgMovieTitle] #we keep the original title from movTY for joining other data later
else:
movieDetails[key] = [value2] #If not a title related item just turn into list
else: #This adds the value from the next movie title to the existing on intial value
if key == 'title': #checks for titles that came from API as 'N/A'
if value2 != 'N/A':
value.append(OrgMovieTitle) #if title does not have 'N/A' put original title form MovTY instead
movieDetails[key] = value
else:#if it does have 'N/A' leave it in there
value.append(value2)
movieDetails[key] = value
else: #Addes values to keys other than the title key.
value.append(value2)
movieDetails[key] = value
return movieDetails
Then I applied the above function to my tuple list of movies and years.
for mov in movTY:
movieInfo2 = requests.get('http://www.omdbapi.com/?apikey='+API_KEY+'&t='+mov[0]+'&y='+mov[1]).json()
movieDetails = moviesDict(movieDetails, movieInfo2, mov[0])#If error produced it is because API requesting requires subscription to OMDb API to make more than 1,000 requests
Finally, the above dictionary was converted to a Pandas Dataframe.
dfmovieDetails = pd.DataFrame.from_dict(movieDetails)
dfmovieDetails.head(5)
NOTE: I would definitely recommend storing all the above data in a CSV file so more requests will not have to be made
dfmovieDetails.to_csv(‘OMDb API Data’)
I noticed after doing the above work that there were some movies that were not recognized by OMDb API; after careful review, it seems like Star Wars Ep. VIII: The Last Jedi was not recognized along with others. A quick manual search shows that it is because OMDb API expects the title of Star Wars: Episode VIII. Furthermore, it appeared 740 movies were lost due to movie titles not matching.
In my original project, I continued on with the project without the 740 movies; however, in the next part of this 2 part series, we will continue on to see how we can capture more movies and fix our function above and integrate this new OMDb data with the movie budget dataframe. Thanks for reading and in the meantime check out other projects like this!