I have a column that contains birth dates in my dataset, but according to GDPR rulings I am not allowed to show this. Is it ok if I transform the values from this column to only show ages?
In my dataset I have two columns that contain information about someone's name. Can I combine them in a way that it will show the last name, first name as full name?
Are Null values something I should keep in my dataset, or should I handle those in a way that I can work with it in my model?
All questions we ask ourselves when dealing with data(sets). And there are many many more. Every dataset that we ingest (retrieve/load) needs to be checked and where possible needs to be transformed to get a suitable output for our use case, either being storing the data in a data warehouse, visualizing the data in a BI tool or preparing the data for a Machine Learning Model.
Data Points
To start, let's first go back to the basics. Data consists of so called data points. A data point is a unique point of observation that contains information.
When we work with data we would normally have many data points.
So let's assume we have a row with the values [ 0 1 2 3 ]. Then this row has 4 data points. Similarly we can have a column with the values [ 0 1 2 3 ] also containing the same number of data points. When we combine the two as an example we might have something like:
[ 0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6 ]
This set of data contains a total of 4x4=16 observations or data points.
If we take any data point as point = ( Xi, Yi ) then we can start thinking of transforming this particular point in our dataset.
Data Transformation
There are endless possibilities for transforming data. But for each and every transformation we will actually follow a simple mathematical equation:
For transforming our datapoint ( Xi, Yi ), we need to apply a function that will create a new data point: ( Pi, Qi ).
The function that we will apply depends on our initial analysis of the data and how we want the output to be.
Getting to our output value might take several functions to be applied. In some cases one function will do, e.g. transforming a text value (containing numbers) to a number value. However, most cases are more complex and require more steps.
Let's go back to our three examples at the start of this post to see how that works. For each one of them we apply a function to transform the data:
Birth date to Age
In this example we want to transform a date value to show the current age. The functions that we need to apply are the following:
Add a new column with the Date of today (function to get Today's date)
Add a new column with the date difference between our original value and Today's date (function to get number of days between start and end date)
Divide the number of days by 365 to get an approximate number of years as decimal value. This will not count for leap years. (function to divide two numbers)
Round the number to get an age value (function round number)
Creating a full name column
Based on different columns we want to create a column that combines the values into one value for full name:
Merge the columns First Name and Last Name with a space in between (function concatenate or merge values)
Deal with Null values
Dealing with null values is a complex topic and those values can be handled in different ways. If the data points in the dataset contain null values, but not all values in a row or column contain null values then many transformations can be done. For this post we only look at one example:
Get the mean (or median) value of the set if they are all in the same format (function get mean value)
Replace the null values with the mean (or median) value (function to replace the null values)
Conclusion
While doing data transformations many (complex) steps can be taken. Every use case of data transformation has its own specifics, but in many coding languages functions are available to perform those transformations. This post is an introduction to data transformations and starts with the basic concept of data points. In upcoming posts more detailed examples will be provided on the ways of doing data transformations.
Comments