Skip to main content

Command Palette

Search for a command to run...

Python Fundamentals For Citizen Data Scientist 1 — Managing Datasets

Updated
6 min read
Python Fundamentals For Citizen Data Scientist 1 — Managing Datasets

A citizen data scientist is a person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics (Gartner).

To become a data scientist, one needs to acquire skills in programming language such as Python (DataCarpenter).

The aim of this post is to introduce novices to the fundamentals of Python programming. Specifically, we will look at the data types in dataset and how to handle them in Python.

[1] Python Code Editor

The simplest way of learning Python programming is through the Google Colab Platform. Click this link to start using it.

Type the following code:

print (“Hello World”)

And then, press the keyboard keys [CTRL]+[ENTER] or click the round-shaped play icon to run the code.

Colab will display the text:

Hello World

So easy :-) .

Press enter or click to view image in full size

[2] Python Variables and their Basic Data Types

Computer programs need to store data in their memories before performing the data processing. In programming, data are usually called “variables”.

The types of data will determine the way they will be processed.

Some basic data types are numbers (which can be further categorized into Integers i.e. discrete numbers or Floats i.e. fraction numbers), Strings (which consists of alphabets, punctuations etc.), Dates and Booleans (i.e. True or False).

Let’s take the first record of the Titanic Dataset.

We declare variables to store the above data as follows:

PassengerId= 1
Survived= 0
Pclass= 3
Name= 'Braund, Mr. Owen Harris'
Sex= 'male'
Age= 22
SibSp= 1
Parch= 0
Ticket= 'A/5 21171'
Fare= 7.25
Cabin= ''
Embarked= 'S'

Altogether there are 12 variables in the above codes that hold data in several kinds of data types.

Colab displays data values in red and green color; green represents Numbers (Integers or Floats) and red represents Strings. The strings are required to be enclosed by either a pair of single ('')or double quotes (""). Strings can be empty, e.g. Cabin which contains a pair of quotes without any value in between them.

We print the variable values using the print() function.

print (PassengerId)
print (Survived)
print (Pclass)
print (Name)
print (Sex)
print (Age)
print (SibSp)
print (Parch)
print (Ticket)
print (Fare)
print (Cabin)
print (Embarked)

Output:

Press enter or click to view image in full size

We can also print the variable data types using the type() function.

print (type(PassengerId))
print (type(Survived))
print (type(Pclass))
print (type(Name))
print (type(Sex))
print (type(Age))
print (type(SibSp))
print (type(Parch))
print (type(Ticket))
print (type(Fare))
print (type(Cabin))
print (type(Embarked))

Output:

Identifying the data types that will be used in processing is important because each data types may have different set of operations that can be performed on them.

Sometimes, certain data values may need to be converted into another data type prior to processing to make them more meaningful.

For example, the Survived variable in the Titanic dataset consists only either 1 or 0. 1 means “survived” and 0 means otherwise. We can convert this value into a Boolean Data Type (i.e. True of False); True means “survived” and False means otherwise. This provides more meaning instead of the numbers 1 or 0.

# Redeclare Survive. Convert 0 to False
Survived= False
print (Survived)
print (type(Survived))

Output:

Press enter or click to view image in full size

Another example is the Pclass variable (which represents the passenger class) that contains either 1, 2 or 3. These values are labels and not meant for numeric calculations. They can be declared as Strings by enclosing the number with the quotes.

# Redeclare PClass. Convert Integer 3 to String '3'
Pclass = '3'
print(Pclass)
print(type(Pclass))

Output:

Press enter or click to view image in full size

We have seen in the above example that variables can be reassigned with new values. Each time a new value is given to a variable, its content changes, and so do its data type. Be careful with this as it may affect data processing results at later stage.

[3] Collection Data Types

The above example demonstrates only one record out of the total of 891 records in the Titanic dataset.

Let’s look at the first 5 records.

Press enter or click to view image in full size

To store this kind of data, we need a collection data type. In Python, this is called a List.

The first record can be declared as a list as follows:

record=[1,0,3,'Braund, Mr. Owen Harris','male',22,1,0,'A/5 21171',7.25,'','S']
print(record)
print(type(record))

Output:

To store five records, we need to declare 5 lists and enclosed them in another pair of bracket and separate each of them by a comma as follows:

list_record=[
    [1,0,3,'Braund, Mr. Owen Harris','male',22,1,0,'A/5 21171',7.25,'','S'],
    [2,1,1,'Cumings, Mrs. John Bradley (Florence Briggs Thayer)','female',38,1,0,'PC 17599',71.2833,'C85','C'],
    [3,1,3,'Heikkinen, Miss. Laina','female',26,0,0,'STON/O2. 3101282',7.925,'','S'],
    [4,1,1,'Futrelle, Mrs. Jacques Heath (Lily May Peel)','female',35,1,0,'113803',53.1,'C123','S'],
    [5,0,3,'Allen, Mr. William Henry','male',35,0,0,'373450',8.05,'','S']
]
print(list_record)
print(type(list_record))
print(len(list_record))

Output:

Press enter or click to view image in full size

Use the len() function to get the count of all records.

We can also store the values as a vertical list. For example:

list_passenger_id=[1,2,3,4,5]

list_survived=[0,1,1,1,0]

list_sex =['male','female','female','female','male']

In this way, the first item in each list represents the first record in the Titanic dataset.

Programming is actually a creative abstraction of real world problems :-).

Besides List, there are several other collection data types such as Tuple, Set and Dictionary. (Read more about them here → Organize Data Using List, Tuple, Set and Dictionary)

List data type is very useful for managing datasets.

Python comes with even more powerful packages for managing datasets such as Python Data Analysis (Pandas) library. Pandas saves a lot of our time and effort in data manipulation works. Let’s have a look at it.

[4] Pandas DataFrame

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive (PyData.org).

Pandas organize datasets in tabular-like structure that consists of a 1-Dimensional (aka Series) and 2-Dimensional (aka DataFrames) tables.

Since this is an additional package, we need to import it first:

import pandas as pd

To create a Series, declare as follows:

# create a series from list_survived
ds_survived = pd.Series(list_survived)
print(ds_survived.info())
ds_survived

Output:

Press enter or click to view image in full size

To create a DataFrame, declare as follows:

# create a dataframe from list_record
df_record=pd.DataFrame(list_record)
print(df_record.info())
df_record

Output:

Press enter or click to view image in full size

The list has been converted into a 2-Dimensional table known as DataFrame.

With DataFrames, many kinds of data manipulation tasks become seamlessly easy and more efficient.

Next, let’s rename the columns:

df_record.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
df_record

Output:

Press enter or click to view image in full size

[5] Importing Datasets

Instead of manual copy-paste jobs, we can automatically fetch data from Internet sources into Pandas dataframe.

For example, we can fetch the Titanic data set from https://archive.org/download/misc-dataset/titanic.csv as follows:

import pandas as pd
pd.set_option('display.max_colwidth', None)
file_url='https://archive.org/download/misc-dataset/titanic.csv'
df_orig = pd.read_csv(file_url,encoding='utf-8')
print(df_orig.info())
df_orig.head()

Output:

Press enter or click to view image in full size

We are done with basic dataset management. Next, we will perform some transformation tasks to make the dataset more efficient for data processing.

Colab Notebook:

Google Colab

Python Fundamentals For Citizen Data Scientist 1

colab.research.google.com

🤓

❖❖❖❖❖❖❖❖❖❖

Python Fundamentals For Citizen Data Scientist Series

This article is a part of a series:

  1. Managing Datasets

  2. Data Transformation

  3. Descriptive Analysis

  4. Descriptive Analysis Visualization

  5. Skewness

  6. Regression

  7. Classification

❖❖❖❖❖❖❖❖❖❖