OMDb API Page:

How to Use the OMDb API in Python and Refine Data Mining Process Part 2 of 2


Last week, in Part 1, we learned how to use the OMDB API step-by-step and automate our requests to the API to mine for all the movie data we need. In Part 2 we will continue on to see how we can improve our defined function from Part 1 and capture more movies. Finally, we will integrate the new OMDb data with the movie budget dataframe.

The Messy Algorithm

From Part 1, we saw the below defined function:

If your eyes glazed over so did mine when I reviewed this mess (lol). Although my first attempt did accomplish enough to result in some good data mining, it was messy and in some parts unnecessary. So here is the new and improved one and sorry if you fought working through my logic in the last one:

Using moviesDict(), the defined function above, I was able to append JSON responses to an initial JSON response that I called a template movie dictionary to append the incoming movies as described in Part I; this time around, however, the difference is that I want to have less data with ‘NaN’ values. Continuing from where we left off, we had just created a Pandas dataframe.

When we check the structure of the above dataframe, we note that there are 734 movies with empty movie titles; this means the OMDb API did not recognize certain movie titles. We can see which movies specifically are not being recognized by doing the following:

By investigation of a couple of these movies by manually requesting, we notice that there is a discrepancy between the movie year in the budget dataframe and what the OMDb API expects. So, we drop all the above movies from our dataframe, reset the index, and convert it back to a dictionary that we can append movies to.

We can pass the ‘NaN’ movies through moviesDict() using just the title of the movies along with the above dictionary.

Again, this dictionary is converted to a dataframe, and we look for movies with ‘NaN’ titles. We extract the indices of these movies and use them on our movie list that we fed into moviesDict(). We account for when dfmovieDetails stopped and subtract that to align with our ‘NaN’ movie list.

WARNING: The movies with ‘NaN’ titles list must be used since we fed it into the moviesDict().

After this point we can see that we have reduced our ‘NaN’ data from 734 to 221. We can do some further cleaning or stop at this point. I ended up stopping after I reduced the ‘NaN’ data to 183 which is less than 4% ‘NaN’ title data!

Once we are satisfied with how clean our data is, we drop any remaining ‘NaN’ title data.

And then we merge the above dataframe with the the budget dataframe.

And that is it! This was dirty work but it would not be called mining if it were not.

An intraoperative neuromonitor who tinkers with data to see what interesting nuggets he can find.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store