Preparing for a Technical Assessment that Requires Data Merging Tasks in Python
Last week, I received an email invitation to participate in a technical assessment. In the email, I was provided some background information about the technical assessment before I had to take it. Basically, I was asked to perform a data merging task linking together different data, and the assessment would also take place on a website called HackerRank. I had the choice to use Python or any other language. In all my previous projects, I had done this before using Python; even so, I knew the smart thing to do in preparation for this assessment would be to review concat(), merge(), and join() from the Pandas package.
Resources Used for Review
The best place to review any functions from a particular package such as Pandas is to look up the official documentation. In most cases, the official documentation has many explanations and examples that will show you the ropes quickly and effectively. To review for this assessment, I looked at Merge, join, concatenate and compare. This documentation coupled with Pandas Joining and merging DataFrame : Exercises, Practice, Solution provided a great place to review and practice.
After doing all the above review, I came into the assessment very refreshed and confident about how to merge data. The assessment however had some surprises that were outside of only merging. I had to know how to import data from comma-separated values (CSV) files. I achieved this by importing the CSV files into Pandas dataframes which is achieved as shown below:
import pandas as pddataframe = pd.read_csv(‘[path to file]’)
Once the data has been imported into dataframes, it is important to check the structure of the dataframes (e.g. datatypes), and the format of the values in each column to look for any abnormalities or just get a feel on what type of data you are messing with. I used dataframe.info() which gives a concise summary of the dataframe and datarame.head() which gives the top 5 rows of the dataframe.
From doing the above, one example in my assessment is that I realized the two dataframes shared a common key column, but the values in the key columns were not exactly the same. The key column in one of the dataframes had to be cleaned; that is, there were prefix values that needed to be removed. In my particular case, the key columns were identical with the exception that one of them had numbers and letters that preceded the matching key. So key column values from dataframe 1 were formatted as ‘#*%$/1235–546-adbs’ and key column values from dataframe 2 were formatted as ‘1235–546-adbs’ I cleaned the column by using the below code that iterates through each row of the column and splits off the key from the rest of the preceding information from the ‘/’ and saves just the key value in the key_column:
dataframe1.loc[:, ‘key_column’] = [x.split(‘/’)[-1] for x in dataframe1[key_column]]
Once the dataframes had matching keys and all other cleaning was done, I merged my two dataframes how was asked, and uploaded my results to a master CSV file by doing the following:
And that is it! That is how I prepared for my technical assessment that had me perform data merging tasks. I hope this helps you as well whether it is just for edification or an upcoming technical assessment. Please check out my other blogs for more cool stuff.