Skip to main content

Command Palette

Search for a command to run...

Python Fundamentals For Citizen Data Scientist 2 — Data Transformation

Updated
6 min read

Data transformation is defined as the technical process of converting data from one format, standard, or structure to another — without changing the content of the datasets — to improve the data quality (Spiceworks.com). This is one of the important tools in statistical analysis (Stats.LibreTexts.org). By transforming raw data into a more analyzable form, it paves the way for data-driven decision making (Funnel.io).

In the previous article, we have seen a sample of Titanic dataset records in a form of card and table.

A Titanic record in card form

Press enter or click to view image in full size

Titanic Records in table form

We can use Python to transform the data in a number of ways.

[0] Get the dataset

import pandas as pd
# set dataframe max column width option
pd.set_option('display.max_colwidth', None)
# set data source url
file_url='https://archive.org/download/misc-dataset/titanic.csv'
# read data
df_orig = pd.read_csv(file_url,encoding='utf-8')
# print dataframe info
print(df_orig.info())
# print dataframe head (top 5 records)
df_orig.head()

output:

Press enter or click to view image in full size

pandas dataframe information

The Pandas dataframe information above tells us that some columns, i.e. the Age ,Cabin and Embarked , are having missing values.

Press enter or click to view image in full size

pandas dataframe — the first 5 records

The pandas dataframe sample rows above indicates that some columns, i.e. the Age, Sex and Embarked could be converted into numerical index for a better data processing. For example …

  • The Sex values i.e. male or female, could be represented by 0 for male and 1 for female.

  • The Embarked values i.e. S (Southampton), C (Cherbourg) and Q (Queenstown), could be represented by 0 for Southampton, 1 for Cherbourg and 2 for Queenstown.

  • The Age values could be represented by Age Group (that differentiates between a child and an adult, assuming that the child age is below 13) e.g. 0 for Age<13 and 1 for Age≥13.

Indexed numbers are just for the sake of representing categorical values; you won’t be able to compare these numbers or subtract them from each other (Developers.Google.Com).

[1] Drop or Impute missing values

In the above example, there were only 714 of 891 valid age related records.

# print the record count of missing age values
print(len(df_orig[df_orig['Age'].isna()]))

# print the record containing missing age values
df_orig[df_orig['Age'].isna()]

output:

Press enter or click to view image in full size

177 records contain missing Age values .

To handle these records, we may drop the records or impute their values (DataCamp.com).

[1.1] Drop the records

Filter the original dataframe by dropping records that contain missing Age values.

# filter the original dataframe by dropping records that contain missing Age values
df_filtered = df_orig.dropna(subset=['Age']).copy()
# print dataframe info
df_filtered.info()
# print dataframe head (top 5 records)
df_filtered.head()

Or, alternatively, apply the filter to the original dataframe itself. Bear in mind that by applying the changes to the original dataset, we will be losing some of the data that might be useful at later stages.

# alternatively, apply the filter to the original dataframe itself
# but we will lose the original data
df_orig.dropna(subset=['Age'], inplace=True)
# print dataframe info
print(df_orig.info())
# print dataframe head (top 5 records)
df_orig.head()

[1.2] Impute the values

Use the rounded mean of the Age for the imputed values.

# impute using mean values
# get a rounded mean value for Age
mean_value = df_orig['Age'].mean().round()
print('mean_value:',mean_value)
# create a df copy of df_orig
df_imputed_mean = df_orig.copy()
# impute the Age values for the df copy
df_imputed_mean['Age'] = df_imputed_mean['Age'].fillna(mean_value)
# print df copy info
print(df_imputed_mean.info())
# print selected df copy records for Age equal mean_value 
df_imputed_mean.loc[df_imputed_mean.Age==mean_value]
# we get 202 instead of 177 (177 missing + 25 valid values)

output:

Press enter or click to view image in full size

Or, alternatively, use the rounded median. Median values can be helpful because it is not sensitive to outliers (QuantHub.com).

# impute using median values
# get a rounded median value for Age
median_value = df_orig['Age'].median().round()
print('median_value:',median_value)
# create a df copy of df_orig
df_imputed_median = df_orig.copy()
# impute the Age values for the df copy
df_imputed_median['Age'] = df_imputed_median['Age'].fillna(median_value)
# print df copy info
print(df_imputed_median.info())
# print selected df copy records for Age equal mean_value 
df_imputed_median.loc[df_imputed_median.Age==median_value]
# we get 202 instead of 177 (177 missing + 25 valid values)

output:

Press enter or click to view image in full size

Mean and Median are applicable to numeric values only.

For categorical values (e.g. Embarked contains either S,C or Q values), apply the Mode (StatCan.gc.ca).

# impute Embarked using mode values
# get a rounded median value for Embarked
embarked_mode_value = df_orig['Embarked'].mode()[0]
print('embarked_mode_value:',embarked_mode_value)
# create a df copy of df_orig
df_imputed_embarked_mode = df_orig.copy()
# impute the Embarked values for the df copy
df_imputed_embarked_mode['Embarked'] = df_imputed_embarked_mode['Embarked'].fillna(embarked_mode_value)
# print df copy info
print(df_imputed_embarked_mode.info())
# print selected df copy records for Embarked equal embarked_mode_value 
df_imputed_embarked_mode.loc[df_imputed_embarked_mode.Embarked==embarked_mode_value]
# we get 202 instead of 177 (177 missing + 25 valid values)

output:

Press enter or click to view image in full size

[2] Replace, Generate Dummies or Binning the values

To use index number for representing categorical data values, we may (1)replace them with the index numbers or (2)generate dummy values for them (Statology.org).

To group numerical values according to certain specified ranges, we apply a technique called binning (Scaler.com). This can help to reduce the number of unique values in the feature, which can be beneficial for encoding categorical data.

[2.1] Replace

Use the replace() function:

df_imputed_embarked_mode['Embarked'].replace(['S', 'C','Q'],[0,1,2], inplace=True)
# print df
df_imputed_embarked_mode

output:

Press enter or click to view image in full size

[2.2] Generate Dummies

The idea of generating dummies is to create new columns for each category (using them as the column names) and then assigning a value of 1 to the rows that belong to that category. Hence, they are the “dummies” of the original column.

Use get_dummies() function:

# generate dummies for Embarked

df_imputed_embarked_mode_dummies = pd.get_dummies( df_imputed_embarked_mode, columns=['Embarked']).copy()

df_imputed_embarked_mode_dummies

output:

Press enter or click to view image in full size

Be careful with “Dummy Variable Trap” (Statology.org) i.e. when the number of dummy variables created is equal to the number of values the categorical value can take on. This leads to multicollinearity, which causes incorrect calculations of regression coefficients and p-values. Tips: If a variable can take on N different values, create only N-1 dummy variables.

In Python, include a parameter drop_first=True for this purpose.

Example:

# avoiding dummy variable trap, 
# create only 2 dummy variables 
# from 3 different values of Embarked

df_imputed_embarked_mode_dummies = pd.get_dummies( df_imputed_embarked_mode, columns=['Embarked'], drop_first=True).copy()

df_imputed_embarked_mode_dummies

output:

Press enter or click to view image in full size

[2.3] Grouping data values (Data Binning)

In the Titanic dataset, Age is an example of a suitable candidate for data binning.

Use cut() function:

# define labels 0=kid ie 0 to 12 years old, 1=adult ie 13 years old and above
cut_labels = [0,1]
# define cut-off points. 0 is the starting value. 12,200 are the upper limits.
cut_bins = [0,12,200]
df_imputed_median['Adult'] = pd.cut(df_imputed_median['Age'], bins=cut_bins, labels=cut_labels)
# check for ages between 11 to 14
df_imputed_median[(df_imputed_median.Age>10) & (df_imputed_median.Age<15)]

output:

Press enter or click to view image in full size

(Read further on the use of cut and qcut)

Colab Notebook:

Google Colab

Python Fundamentals For Citizen Data Scientist 2

colab.research.google.com

🤓