Lab Works

Python Fundamentals For Citizen Data Scientist 2 — Data Transformation

Mohamad Mahmood — Wed, 18 Feb 2026 03:40:05 GMT

Data transformation is defined as the technical process of converting data from one format, standard, or structure to another — without changing the content of the datasets — to improve the data quality (Spiceworks.com). This is one of the important tools in statistical analysis (Stats.LibreTexts.org). By transforming raw data into a more analyzable form, it paves the way for data-driven decision making (Funnel.io).

In the previous article, we have seen a sample of Titanic dataset records in a form of card and table.

A Titanic record in card form

Press enter or click to view image in full size

Titanic Records in table form

We can use Python to transform the data in a number of ways.

[0] Get the dataset

import pandas as pd
# set dataframe max column width option
pd.set_option('display.max_colwidth', None)
# set data source url
file_url='https://archive.org/download/misc-dataset/titanic.csv'
# read data
df_orig = pd.read_csv(file_url,encoding='utf-8')
# print dataframe info
print(df_orig.info())
# print dataframe head (top 5 records)
df_orig.head()

output:

Press enter or click to view image in full size

pandas dataframe information

The Pandas dataframe information above tells us that some columns, i.e. the Age ,Cabin and Embarked , are having missing values.

Press enter or click to view image in full size

pandas dataframe — the first 5 records

The pandas dataframe sample rows above indicates that some columns, i.e. the Age, Sex and Embarked could be converted into numerical index for a better data processing. For example …

The Sex values i.e. male or female, could be represented by 0 for male and 1 for female.
The Embarked values i.e. S (Southampton), C (Cherbourg) and Q (Queenstown), could be represented by 0 for Southampton, 1 for Cherbourg and 2 for Queenstown.
The Age values could be represented by Age Group (that differentiates between a child and an adult, assuming that the child age is below 13) e.g. 0 for Age<13 and 1 for Age≥13.

Indexed numbers are just for the sake of representing categorical values; you won’t be able to compare these numbers or subtract them from each other (Developers.Google.Com).

[1] Drop or Impute missing values

In the above example, there were only 714 of 891 valid age related records.

# print the record count of missing age values
print(len(df_orig[df_orig['Age'].isna()]))

# print the record containing missing age values
df_orig[df_orig['Age'].isna()]

output:

Press enter or click to view image in full size

177 records contain missing Age values .

To handle these records, we may drop the records or impute their values (DataCamp.com).

[1.1] Drop the records

Filter the original dataframe by dropping records that contain missing Age values.

# filter the original dataframe by dropping records that contain missing Age values
df_filtered = df_orig.dropna(subset=['Age']).copy()
# print dataframe info
df_filtered.info()
# print dataframe head (top 5 records)
df_filtered.head()

Or, alternatively, apply the filter to the original dataframe itself. Bear in mind that by applying the changes to the original dataset, we will be losing some of the data that might be useful at later stages.

# alternatively, apply the filter to the original dataframe itself
# but we will lose the original data
df_orig.dropna(subset=['Age'], inplace=True)
# print dataframe info
print(df_orig.info())
# print dataframe head (top 5 records)
df_orig.head()

[1.2] Impute the values

Use the rounded mean of the Age for the imputed values.

# impute using mean values
# get a rounded mean value for Age
mean_value = df_orig['Age'].mean().round()
print('mean_value:',mean_value)
# create a df copy of df_orig
df_imputed_mean = df_orig.copy()
# impute the Age values for the df copy
df_imputed_mean['Age'] = df_imputed_mean['Age'].fillna(mean_value)
# print df copy info
print(df_imputed_mean.info())
# print selected df copy records for Age equal mean_value 
df_imputed_mean.loc[df_imputed_mean.Age==mean_value]
# we get 202 instead of 177 (177 missing + 25 valid values)

output:

Press enter or click to view image in full size

Or, alternatively, use the rounded median. Median values can be helpful because it is not sensitive to outliers (QuantHub.com).

# impute using median values
# get a rounded median value for Age
median_value = df_orig['Age'].median().round()
print('median_value:',median_value)
# create a df copy of df_orig
df_imputed_median = df_orig.copy()
# impute the Age values for the df copy
df_imputed_median['Age'] = df_imputed_median['Age'].fillna(median_value)
# print df copy info
print(df_imputed_median.info())
# print selected df copy records for Age equal mean_value 
df_imputed_median.loc[df_imputed_median.Age==median_value]
# we get 202 instead of 177 (177 missing + 25 valid values)

output:

Press enter or click to view image in full size

Mean and Median are applicable to numeric values only.

For categorical values (e.g. Embarked contains either S,C or Q values), apply the Mode (StatCan.gc.ca).

# impute Embarked using mode values
# get a rounded median value for Embarked
embarked_mode_value = df_orig['Embarked'].mode()[0]
print('embarked_mode_value:',embarked_mode_value)
# create a df copy of df_orig
df_imputed_embarked_mode = df_orig.copy()
# impute the Embarked values for the df copy
df_imputed_embarked_mode['Embarked'] = df_imputed_embarked_mode['Embarked'].fillna(embarked_mode_value)
# print df copy info
print(df_imputed_embarked_mode.info())
# print selected df copy records for Embarked equal embarked_mode_value 
df_imputed_embarked_mode.loc[df_imputed_embarked_mode.Embarked==embarked_mode_value]
# we get 202 instead of 177 (177 missing + 25 valid values)

output:

Press enter or click to view image in full size

[2] Replace, Generate Dummies or Binning the values

To use index number for representing categorical data values, we may (1)replace them with the index numbers or (2)generate dummy values for them (Statology.org).

To group numerical values according to certain specified ranges, we apply a technique called binning (Scaler.com). This can help to reduce the number of unique values in the feature, which can be beneficial for encoding categorical data.

[2.1] Replace

Use the replace() function:

df_imputed_embarked_mode['Embarked'].replace(['S', 'C','Q'],[0,1,2], inplace=True)
# print df
df_imputed_embarked_mode

output:

Press enter or click to view image in full size

[2.2] Generate Dummies

The idea of generating dummies is to create new columns for each category (using them as the column names) and then assigning a value of 1 to the rows that belong to that category. Hence, they are the “dummies” of the original column.

Use get_dummies() function:

# generate dummies for Embarked

df_imputed_embarked_mode_dummies = pd.get_dummies( df_imputed_embarked_mode, columns=['Embarked']).copy()

df_imputed_embarked_mode_dummies

output:

Press enter or click to view image in full size

Be careful with “Dummy Variable Trap” (Statology.org) i.e. when the number of dummy variables created is equal to the number of values the categorical value can take on. This leads to multicollinearity, which causes incorrect calculations of regression coefficients and p-values. Tips: If a variable can take on N different values, create only N-1 dummy variables.

In Python, include a parameter drop_first=True for this purpose.

Example:

# avoiding dummy variable trap, 
# create only 2 dummy variables 
# from 3 different values of Embarked

df_imputed_embarked_mode_dummies = pd.get_dummies( df_imputed_embarked_mode, columns=['Embarked'], drop_first=True).copy()

df_imputed_embarked_mode_dummies

output:

Press enter or click to view image in full size

[2.3] Grouping data values (Data Binning)

In the Titanic dataset, Age is an example of a suitable candidate for data binning.

Use cut() function:

# define labels 0=kid ie 0 to 12 years old, 1=adult ie 13 years old and above
cut_labels = [0,1]
# define cut-off points. 0 is the starting value. 12,200 are the upper limits.
cut_bins = [0,12,200]
df_imputed_median['Adult'] = pd.cut(df_imputed_median['Age'], bins=cut_bins, labels=cut_labels)
# check for ages between 11 to 14
df_imputed_median[(df_imputed_median.Age>10) & (df_imputed_median.Age<15)]

output:

Press enter or click to view image in full size

(Read further on the use of cut and qcut)

Colab Notebook:

Google Colab

Python Fundamentals For Citizen Data Scientist 2

colab.research.google.com

🤓

Transforming Data From HTML Tables

Mohamad Mahmood — Mon, 16 Feb 2026 07:59:14 GMT

Question 1: Find the source of data based on below requirement and fetch them into Power Query

Question 2: Find the source “public data” regarding the amount of car sales in the local market.

Question 1 answer : https://web.archive.org/web/20251121112439/https://data.gov.my/dashboard/car-popularity

Question 2 answer: https://www.pcauto.com/my/sales-ranking

We managed to get the data but it is not in a ready-shape. We need to do some data transformation in order to prepare for our reporting.

Step 1 — Add Index Column

Go to Add Column

Click Index Column → From 1

Step 2 — Create GroupID

Go to Add Column → Custom Column

Name the column: GroupID

Enter this formula: Number.RoundUp([Index] / 2)

Click OK

Step 3: Identify Model vs Quantity

Add another custom column:

Add Column → Custom Column

Formula: if Number.Mod([Index], 2) = 1 then "Model" else "Quantity"

Name it: Type

Step 4: Pivot the Data

Now we reshape the table.

Select the Type column

Go to Transform → Pivot Column

Values column = Column1

Advanced options → Don't Aggregate

Click OK

Step 5: Fill Down + Remove duplicates

Remove other columns than Model and Quantity.

Download example:

https://archive.org/download/analytica/web-scrape-carmodel-quantity.xlsx

You can view the applied steps by following below example:

Python Fundamentals For Citizen Data Scientist 1 — Managing Datasets

Mohamad Mahmood — Mon, 16 Feb 2026 02:13:33 GMT

A citizen data scientist is a person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics (Gartner).

To become a data scientist, one needs to acquire skills in programming language such as Python (DataCarpenter).

The aim of this post is to introduce novices to the fundamentals of Python programming. Specifically, we will look at the data types in dataset and how to handle them in Python.

[1] Python Code Editor

The simplest way of learning Python programming is through the Google Colab Platform. Click this link to start using it.

Type the following code:

print (“Hello World”)

And then, press the keyboard keys [CTRL]+[ENTER] or click the round-shaped play icon to run the code.

Colab will display the text:

Hello World

So easy :-) .

Press enter or click to view image in full size

[2] Python Variables and their Basic Data Types

Computer programs need to store data in their memories before performing the data processing. In programming, data are usually called “variables”.

The types of data will determine the way they will be processed.

Some basic data types are numbers (which can be further categorized into Integers i.e. discrete numbers or Floats i.e. fraction numbers), Strings (which consists of alphabets, punctuations etc.), Dates and Booleans (i.e. True or False).

Let’s take the first record of the Titanic Dataset.

We declare variables to store the above data as follows:

PassengerId= 1
Survived= 0
Pclass= 3
Name= 'Braund, Mr. Owen Harris'
Sex= 'male'
Age= 22
SibSp= 1
Parch= 0
Ticket= 'A/5 21171'
Fare= 7.25
Cabin= ''
Embarked= 'S'

Altogether there are 12 variables in the above codes that hold data in several kinds of data types.

Colab displays data values in red and green color; green represents Numbers (Integers or Floats) and red represents Strings. The strings are required to be enclosed by either a pair of single ('')or double quotes (""). Strings can be empty, e.g. Cabin which contains a pair of quotes without any value in between them.

We print the variable values using the print() function.

print (PassengerId)
print (Survived)
print (Pclass)
print (Name)
print (Sex)
print (Age)
print (SibSp)
print (Parch)
print (Ticket)
print (Fare)
print (Cabin)
print (Embarked)

Output:

Press enter or click to view image in full size

We can also print the variable data types using the type() function.

print (type(PassengerId))
print (type(Survived))
print (type(Pclass))
print (type(Name))
print (type(Sex))
print (type(Age))
print (type(SibSp))
print (type(Parch))
print (type(Ticket))
print (type(Fare))
print (type(Cabin))
print (type(Embarked))

Output:

Identifying the data types that will be used in processing is important because each data types may have different set of operations that can be performed on them.

Sometimes, certain data values may need to be converted into another data type prior to processing to make them more meaningful.

For example, the Survived variable in the Titanic dataset consists only either 1 or 0. 1 means “survived” and 0 means otherwise. We can convert this value into a Boolean Data Type (i.e. True of False); True means “survived” and False means otherwise. This provides more meaning instead of the numbers 1 or 0.

# Redeclare Survive. Convert 0 to False
Survived= False
print (Survived)
print (type(Survived))

Output:

Press enter or click to view image in full size

Another example is the Pclass variable (which represents the passenger class) that contains either 1, 2 or 3. These values are labels and not meant for numeric calculations. They can be declared as Strings by enclosing the number with the quotes.

# Redeclare PClass. Convert Integer 3 to String '3'
Pclass = '3'
print(Pclass)
print(type(Pclass))

Output:

Press enter or click to view image in full size

We have seen in the above example that variables can be reassigned with new values. Each time a new value is given to a variable, its content changes, and so do its data type. Be careful with this as it may affect data processing results at later stage.

[3] Collection Data Types

The above example demonstrates only one record out of the total of 891 records in the Titanic dataset.

Let’s look at the first 5 records.

Press enter or click to view image in full size

To store this kind of data, we need a collection data type. In Python, this is called a List.

The first record can be declared as a list as follows:

record=[1,0,3,'Braund, Mr. Owen Harris','male',22,1,0,'A/5 21171',7.25,'','S']
print(record)
print(type(record))

Output:

To store five records, we need to declare 5 lists and enclosed them in another pair of bracket and separate each of them by a comma as follows:

list_record=[
    [1,0,3,'Braund, Mr. Owen Harris','male',22,1,0,'A/5 21171',7.25,'','S'],
    [2,1,1,'Cumings, Mrs. John Bradley (Florence Briggs Thayer)','female',38,1,0,'PC 17599',71.2833,'C85','C'],
    [3,1,3,'Heikkinen, Miss. Laina','female',26,0,0,'STON/O2. 3101282',7.925,'','S'],
    [4,1,1,'Futrelle, Mrs. Jacques Heath (Lily May Peel)','female',35,1,0,'113803',53.1,'C123','S'],
    [5,0,3,'Allen, Mr. William Henry','male',35,0,0,'373450',8.05,'','S']
]
print(list_record)
print(type(list_record))
print(len(list_record))

Output:

Press enter or click to view image in full size

Use the len() function to get the count of all records.

We can also store the values as a vertical list. For example:

list_passenger_id=[1,2,3,4,5]

list_survived=[0,1,1,1,0]

list_sex =['male','female','female','female','male']

In this way, the first item in each list represents the first record in the Titanic dataset.

Programming is actually a creative abstraction of real world problems :-).

Besides List, there are several other collection data types such as Tuple, Set and Dictionary. (Read more about them here → Organize Data Using List, Tuple, Set and Dictionary)

List data type is very useful for managing datasets.

Python comes with even more powerful packages for managing datasets such as Python Data Analysis (Pandas) library. Pandas saves a lot of our time and effort in data manipulation works. Let’s have a look at it.

[4] Pandas DataFrame

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive (PyData.org).

Pandas organize datasets in tabular-like structure that consists of a 1-Dimensional (aka Series) and 2-Dimensional (aka DataFrames) tables.

Since this is an additional package, we need to import it first:

import pandas as pd

To create a Series, declare as follows:

# create a series from list_survived
ds_survived = pd.Series(list_survived)
print(ds_survived.info())
ds_survived

Output:

Press enter or click to view image in full size

To create a DataFrame, declare as follows:

# create a dataframe from list_record
df_record=pd.DataFrame(list_record)
print(df_record.info())
df_record

Output:

Press enter or click to view image in full size

The list has been converted into a 2-Dimensional table known as DataFrame.

With DataFrames, many kinds of data manipulation tasks become seamlessly easy and more efficient.

Next, let’s rename the columns:

df_record.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
df_record

Output:

Press enter or click to view image in full size

[5] Importing Datasets

Instead of manual copy-paste jobs, we can automatically fetch data from Internet sources into Pandas dataframe.

For example, we can fetch the Titanic data set from https://archive.org/download/misc-dataset/titanic.csv as follows:

import pandas as pd
pd.set_option('display.max_colwidth', None)
file_url='https://archive.org/download/misc-dataset/titanic.csv'
df_orig = pd.read_csv(file_url,encoding='utf-8')
print(df_orig.info())
df_orig.head()

Output:

Press enter or click to view image in full size

We are done with basic dataset management. Next, we will perform some transformation tasks to make the dataset more efficient for data processing.

Colab Notebook:

Google Colab

Python Fundamentals For Citizen Data Scientist 1

colab.research.google.com

🤓

❖❖❖❖❖❖❖❖❖❖

Python Fundamentals For Citizen Data Scientist Series

This article is a part of a series:

❖❖❖❖❖❖❖❖❖❖

Editing MS Access Database Model In Excel Power Pivot

Mohamad Mahmood — Mon, 16 Feb 2026 01:07:59 GMT

[1] Download access database and view the content using excel application:

https://archive.org/download/oltp-olap/Financial_Sample_OLAP.accdb

[2] Create a new Excel blank worksheet

[3] Go to Data tab.

Select Get Data>From Database>From Microsoft Access Database

[4] In the Navigator window, select multiple tables.

[5] In the Data tab, under the Data Tools section, select Manage Data Model

A Power Pivot for Excel window will display the tables.
Click Diagram View button.

Tables will be displayed in diagram forms.

[6] Link the tables by dragging the key fields from one table to the corresponding fields in the matching tables.

https://archive.org/download/oltp-olap/Financial_Sample_OLAP.xlsx

https://archive.org/download/oltp-olap/Financial_Sample_OLTP.xlsx

https://archive.org/download/oltp-olap/campus_documentdb.json

https://archive.org/download/oltp-olap/campus_columndb.json

https://archive.org/download/oltp-olap/campus_graphdb.json

An efficient approach for textual data classification using deep learning

Mohamad Mahmood — Thu, 15 Sep 2022 04:00:00 GMT

Abstract:
Text categorization is an effective activity that can be accomplished using a variety of classification algorithms. In machine learning, the classifier is built by learning the features of categories from a set of preset training data. Similarly, deep learning offers enormous benefits for text classification since they execute highly accurately with lower-level engineering and processing. This paper employs machine and deep learning techniques to classify textual data. Textual data contains much useless information that must be pre-processed. We clean the data, impute missing values, and eliminate the repeated columns. Next, we employ machine learning algorithms: logistic regression, random forest, K-nearest neighbors (KNN), and deep learning algorithms: long short-term memory (LSTM), artificial neural network (ANN), and gated recurrent unit (GRU) for classification. Results reveal that LSTM achieves 92% accuracy outperforming all other model and baseline studies.

(Abdullah Alqahtani, H. Khan, Shtwai Alsubai, Mohemmed Sha, Ahmad S. Almadhor, Tayyab Iqbal, Sidra Abbas)

https://www.frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2022.992296/full

https://www.semanticscholar.org/paper/An-efficient-approach-for-textual-data-using-deep-Alqahtani-Khan/535455a9c44c5e783da02b49299069f6a225d647

Discussion:

The paper "An efficient approach for textual data classification using deep learning" brings attention to the potential of machine and deep learning models, such as LSTM and GRU, for text classification tasks. Interestingly, the authors use the Titanic dataset for their experiments, which primarily contains structured data and limited text fields. This choice raises intriguing questions about how text-focused models can be adapted for datasets that are not traditionally text-heavy. Could this approach point to new ways of extracting or representing textual features from structured data? Or does it highlight the importance of selecting datasets that align more closely with a study's goals? This opens up a larger conversation about balancing creativity in research with ensuring methodological alignment, inviting us to reflect on how we choose and use datasets in machine learning studies.

Lab Works:

1. V ← LE(data)  {Label Encoding}  
2. μ ← (1/m) * Σ(i=1 to m) X^(i)  {Normalizing data}  
3. X ← X - μ  
4. σ² ← (1/m) * Σ(i=1 to m) (X^(i))²  
5. X ← X / σ²  
6. D2 ← np.array(Df)  {Convergence of Matrix}  
7. for l in range(1, len(L))  {Weight Initialization}  
   1. W[l] ← rand((m × n)) * √(2 / n[l-1])  
8. end for  
9. V ← MaxPooling(F)  {Conversion of Vector}  
10. lstm ← LSTM(V)  {LSTM layer}  
11. f_lstm ← Hidden(lstm)  {Hidden layer}  
12. PC ← PredictClass(f_lstm)  {Dense layer}  
13. for i in range(1, len(PC)) do  
    1. if PC[i] == y_test[i] then  
       1. return PC[i]  
    2. else  
       1. return y_test[i]  
    3. end if  
14. end for  
15. return Output

https://colab.research.google.com/drive/1NLegXg7WTNgajyU1XgX1g-Qpe9r6Fwqp

N-gram-based text categorization

Mohamad Mahmood — Sat, 31 Dec 1994 04:00:00 GMT

Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems.

We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8% correct classification rate on Usenet newsgroup articles written in different languages. The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80% correct classification rate. There are also several obvious directions for improving the system`s classification performance in those cases where it did not do as well.

The system is based on calculating and comparing profiles of N-gram frequencies. First, we use the system to compute profiles on training set data that represent the various categories, e.g., language samples or newsgroup content samples. Then the system computes a profile for a particular document that is to be classified. Finally, the system computes a distance measure between the document`s profile and each of the category profiles. The system selects the category whose profile has the smallest distance to the document`s profile. The profiles involved are quite small, typically 10K bytes for a category training set, and less than 4K bytes for an individual document.

Using N-gram frequency profiles provides a simple and reliable way to categorize documents in a wide range of classification tasks.

(W. B. Cavnar, J. Trenkle)

https://sdmines.sdsmt.edu/upload/directory/materials/12247_20070403135416.pdf

https://www.semanticscholar.org/paper/N-gram-based-text-categorization-Cavnar-Trenkle/49af572ef8f7ea89db06d5e7b66e9369c22d7607

Methodology

N-Gram Generation:
- The algorithm extracts all possible contiguous sequences of n characters (n-grams) from a text.
- These n-grams are then ranked based on their frequency of occurrence.
Profile Construction:
- Each document or category is represented as a profile containing the most frequent n-grams (e.g., top 300).
- A similar profile is generated for the text being classified.
Similarity Comparison:
- The categorization task involves comparing the text's n-gram profile against the profiles of known categories.
- The similarity metric used is the rank-order distance, which measures how closely the n-gram frequencies in the input text align with those in the category profiles.
Language Identification:
- The paper tested the method extensively for language identification, demonstrating its capability to distinguish between languages with high accuracy.

Discussions:

While the n-character-based approach by Cavnar and Trenkle is effective and simple, it raises interesting questions about its broader use. For example, how well does it handle texts where different categories share similar patterns, or when semantic meaning is important?

Lab Works:

https://colab.research.google.com/drive/1ciPDoOmyI6tgOEt17PQczsjGaPc9uMxd

Word Association Norms, Mutual Information, and Lexicography

Mohamad Mahmood — Mon, 26 Jun 1989 04:00:00 GMT

Abstract:

The term word association is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor. ) We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). This paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.

(Kenneth Ward Church, Patrick Hanks)

https://www.semanticscholar.org/paper/Word-Association-Norms%2C-Mutual-Information%2C-and-Church-Hanks

https://aclanthology.org/P89-1010.pdf

Key Concepts

Mutual Information:
- MI compares the joint probability of observing two events (or words) together, P(x,y), with the probability of observing them independently, P(x)P(y).
- If P(x,y) is significantly larger than P(x)P(y), it indicates a strong association, resulting in I(x,y)>0.
- Conversely, if P(x,y) is similar to P(x)P(y), then I(x,y)≈0, suggesting no significant relationship.
- If x and y are in complementary distribution, they do not occur together. This means that P(x,y) is very low or approaches zero..
Estimation of Probabilities:
- The probabilities P(x) and P(y) are estimated by counting occurrences in a corpus, denoted as f(x) and f(y), and normalizing by the total corpus size N.
- Joint probabilities P(x,y) are estimated by counting how often x is followed by y within a specified window size www (e.g., 5 words).
Window Size:
- The choice of window size affects the type of relationships captured:
  - Smaller windows identify fixed expressions (like idioms).
  - Larger windows capture broader semantic relationships.
- A window size of 5 words is chosen as a compromise to balance capturing meaningful relationships without losing contextual adjacency.
Count Threshold:
- The authors set a threshold, avoiding pairs with very small counts (e.g., f(x,y)<5), to maintain stability in the association ratio. This avoids unreliable estimates that can arise from low counts.
Symmetry in Probabilities:
- MI is symmetric (P(x,y)=P(y,x)), meaning the relationship holds regardless of the order of the words.
- The association ratio, however, is not symmetric because it captures linear precedence (the order of appearance). This asymmetry can reveal interesting biases or relationships in data, such as syntactic patterns or sociolinguistic trends.

Lab Works:

https://colab.research.google.com/drive/1f5yfmhAocDZ9bHeg1QKXY_086dEHIVTi

Frequency Analysis of English Usage: Lexicon and Grammar

Mohamad Mahmood — Fri, 01 Jan 1982 04:00:00 GMT

Abstract:

This volume presents the results of a lexical and grammatical analysis of a one-million-word corpus of present-day American English, originally assembled at Brown University in 1963-64 and thus commonly referred to by researchers interested in text analysis as the Brown Corpus. The Brown Corpus, which was compiled with the view of making it broadly representative of current edited American English, contains selections from five hundred samples belonging to fifteen different genres of writing. The genres ranges from newspaper reportage to technical writing, and from philosophical essays to various kinds of fiction.

(W. Nelson Francis (Author), Henry Kucera (Author), Andrew W. Mackie (Author))

https://www.amazon.com/FREQUENCY-ANALYSIS-ENGLISH-USAGE-LEXICON/dp/0395322502

https://journals.sagepub.com/doi/abs/10.1177/007542428501800107

The introduction of the paper "Frequency Analysis of English Usage: Lexicon and Grammar" sets the stage for a comprehensive exploration of how frequency data can illuminate patterns in English language usage.

Purpose of the Study: The paper aims to analyze the frequency of words and grammatical structures in English, providing insights into their usage in various contexts. This analysis is crucial for understanding language patterns and can inform both linguistic theory and practical applications in language education and computational linguistics .
Importance of Frequency Analysis: The introduction emphasizes that frequency analysis is a valuable tool for linguists. It allows researchers to identify which words and grammatical forms are most commonly used, thereby revealing trends in language evolution and usage. This can help in distinguishing between standard and non-standard forms of English .
Methodological Framework: The authors outline the methodological approach they will employ, which includes the collection of large corpora of English text. By analyzing these corpora, the study seeks to quantify the frequency of various lexical items and grammatical constructions, providing a robust statistical basis for their findings .
Relevance to Language Learning: The introduction also touches on the implications of frequency analysis for language teaching. Understanding which words and structures are most frequently used can guide educators in developing curricula that prioritize these elements, thus enhancing language acquisition for learners .
Contribution to Linguistic Research: Finally, the authors position their work within the broader field of linguistic research, suggesting that their findings will contribute to ongoing discussions about language use, change, and the relationship between lexicon and grammar. This positions the study as a significant addition to the existing literature on English linguistics .

In summary, the introduction effectively outlines the study's objectives, significance, and methodological approach, setting a clear framework for the subsequent analysis presented in the paper.

Lab Works

https://colab.research.google.com/drive/1e670xj2jk-NUFQW8K3VKRUeTCRxNNpfM