How to Use the OMDb API in Python and Refine Data Mining Process Part 2 of 2

6 min readMay 3, 2021

Recap

Last week, in Part 1, we learned how to use the OMDB API step-by-step and automate our requests to the API to mine for all the movie data we need. In Part 2 we will continue on to see how we can improve our defined function from Part 1 and capture more movies. Finally, we will integrate the new OMDb data with the movie budget dataframe.

The Messy Algorithm

From Part 1, we saw the below defined function:

def moviesDict(movieDetails, movieInfo2, OrgMovieTitle):
    if len(movieDetails) > len(movieInfo2): #This fills in fields that might not have been present in template
        for key in movieDetails.keys(): #Goes through each key in template
            if movieInfo2.get(key): #Checks if template key is present in JSON response
                #Do nothing since this key is present;
                ;
            else:#create this key so our column sizes match for each movie title
                movieInfo2[key] = 'N/A' #This will also fill in movies that do not come up in the API   
                
    #
    for key, value in movieDetails.items():
        for key2, value2 in movieInfo2.items():
            if key == key2:
                if value == value2:#this part turns the initial value into a list for development of the columns
                    if key == 'title':
                        movieDetails[key] = [OrgMovieTitle] #we keep the original title from movTY for joining other data later
                    else:
                        movieDetails[key] = [value2] #If not a title related item just turn into list
                else: #This adds the value from the next movie title to the existing on intial value
                    if key == 'title': #checks for titles that came from API as 'N/A'
                        if value2 != 'N/A':
                            value.append(OrgMovieTitle) #if title does not have 'N/A' put original title form MovTY instead
                            movieDetails[key] = value 
                        else:#if it does have 'N/A' leave it in there
                            value.append(value2)
                            movieDetails[key] = value 
                    else: #Addes values to keys other than the title key.
                        value.append(value2)
                        movieDetails[key] = value        
    return movieDetails

If your eyes glazed over so did mine when I reviewed this mess (lol). Although my first attempt did accomplish enough to result in some good data mining, it was messy and in some parts unnecessary. So here is the new and improved one and sorry if you fought working through my logic in the last one:

def moviesDict(movie_dict_template, incoming_movie, OrgMovieTitle):
    '''
    movie_dict_template -- a dictionary or JSON input with keys and values already in place
    ex from OMDb response:
    {'Title': 'Avatar',
     'Year': '2009',
     'Rated': 'PG-13',
     'Released': '18 Dec 2009',
     'Runtime': '162 min',...}
    
    NOTE: must have a 'Title' key for template input or make change in code
    incoming_movie -- a dictionary or json input that will be appended to movie_dict_template
    NOTEE: incoming_movie should be similarly formatted to movie_dict_template input
    OrgMovieTitle -- a string of the movie title
    -----------------------------------------------------------------------------------------
    '''
    
# checks to make sure template data has every column as a list to   # append incoming movie data
    if type(list(movie_dict_template.values())[0]) == str: 
# If first key values not list then the rest of columns' values not # list
        for key, value in movie_dict_template.items():
            value = [value]
            movie_dict_template[key] = value
    else: #Every column is already a list to add other movies to
        ;
    
# This fills in fields that might not have been present in incoming # movie
    for key, value in movie_dict_template.items(): 
# Goes through each key in template
        if incoming_movie.get(key): 
# Checks if template key is present in JSON response
            #Do nothing since this key is present;
            ;
        else:
# create this key so our column sizes match for each movie title
            incoming_movie[key] = None # This will also fill in movies that do not come up in the API
    
# This fills in fields that might not have been present in template
    for key, value in incoming_movie.items(): 
# Goes through each key in incoming movie
        if movie_dict_template.get(key): 
# Checks if incoming movie key is present in template
# Do nothing since this key is present;
            ;
        else:
# create this key so our column sizes match for each movie title
            incoming_movie[key] = None
               
# Append incoming movie to template
    for key, value in movie_dict_template.items():
        for key2, value2 in incoming_movie.items():
            if key == key2: 
# match up the keys from the incoming movie and template
# This appends the incoming movie value to the template value(s)
                if key == 'Title': 
# checks for title that came from incoming movie as 'NaN'
                    if value2 != None:
# if title does not have 'NaN' put original title instead
                        movie_dict_template[key].append(OrgMovieTitle)
                    else:#if it does have 'NaN' leave it in there
                        movie_dict_template[key].append(value2)
                        
                else: 
#Addes values to keys other than the title key.
                    movie_dict_template[key].append(value2)
                    
    #returns newly appended template               
    return movie_dict_template

Using moviesDict(), the defined function above, I was able to append JSON responses to an initial JSON response that I called a template movie dictionary to append the incoming movies as described in Part I; this time around, however, the difference is that I want to have less data with ‘NaN’ values. Continuing from where we left off, we had just created a Pandas dataframe.

dfmovieDetails = pd.DataFrame.from_dict(movieDetails)
dfmovieDetails

When we check the structure of the above dataframe, we note that there are 734 movies with empty movie titles; this means the OMDb API did not recognize certain movie titles. We can see which movies specifically are not being recognized by doing the following:

NAN_indices = dfmovieDetails[dfmovieDetails.Title.isna()].index
NAN_movieTitles = [movieTitles[i] for i in NAN_indices]
NAN_movieTitles

By investigation of a couple of these movies by manually requesting, we notice that there is a discrepancy between the movie year in the budget dataframe and what the OMDb API expects. So, we drop all the above movies from our dataframe, reset the index, and convert it back to a dictionary that we can append movies to.

dfmovieDetails.dropna(subset=['Title'], inplace=True)
dfmovieDetails.reset_index(drop=True, inplace=True)
movieDetails = pd.DataFrame.to_dict(dfmovieDetails, orient='list')
dfmovieDetails

We can pass the ‘NaN’ movies through moviesDict() using just the title of the movies along with the above dictionary.

#We just use titles this time
for MOV in NAN_movieTitles:
    NAN_movieInfo2 = requests.get('http://www.omdbapi.com/?apikey='+API_KEY+'&t='+MOV).json()
    movieDetails = moviesDict(movieDetails, NAN_movieInfo2, MOV)

Again, this dictionary is converted to a dataframe, and we look for movies with ‘NaN’ titles. We extract the indices of these movies and use them on our movie list that we fed into moviesDict(). We account for when dfmovieDetails stopped and subtract that to align with our ‘NaN’ movie list.

NAN_indices = df_NAN_movieDetails[df_NAN_movieDetails.Title.isna()].index-4392
NAN_indices

WARNING: The movies with ‘NaN’ titles list must be used since we fed it into the moviesDict().

After this point we can see that we have reduced our ‘NaN’ data from 734 to 221. We can do some further cleaning or stop at this point. I ended up stopping after I reduced the ‘NaN’ data to 183 which is less than 4% ‘NaN’ title data!

Once we are satisfied with how clean our data is, we drop any remaining ‘NaN’ title data.

df_NAN_movieDetails.dropna(subset=['Title'], inplace=True)
df_NAN_movieDetails.reset_index(drop=True, inplace=True)df_NAN_movieDetails

And then we merge the above dataframe with the the budget dataframe.

df_budget = dfmoviebudget.rename(columns={"movie":"Title"})
df_NAN_movieDetails.merge(df_budget, how='inner', on='Title'

And that is it! This was dirty work but it would not be called mining if it were not.

How to Use the OMDb API in Python and Refine Data Mining Process Part 2 of 2

Recap

The Messy Algorithm

Written by John Paul Hernandez Alcala