<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Lab Works]]></title><description><![CDATA[Lab Works]]></description><link>https://labworks.razzi.my</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1733468829434/dbc31e00-733c-4aa7-b141-23953a7785b8.jpeg</url><title>Lab Works</title><link>https://labworks.razzi.my</link></image><generator>RSS for Node</generator><lastBuildDate>Thu, 16 Apr 2026 08:45:18 GMT</lastBuildDate><atom:link href="https://labworks.razzi.my/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Python Fundamentals For Citizen Data Scientist 2 — Data Transformation]]></title><description><![CDATA[Data transformation is defined as the technical process of converting data from one format, standard, or structure to another — without changing the content of the datasets — to improve the data quality (Spiceworks.com). This is one of the important ...]]></description><link>https://labworks.razzi.my/python-fundamentals-for-citizen-data-scientist-2-data-transformation</link><guid isPermaLink="true">https://labworks.razzi.my/python-fundamentals-for-citizen-data-scientist-2-data-transformation</guid><category><![CDATA[data transformation]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Wed, 18 Feb 2026 03:40:05 GMT</pubDate><content:encoded><![CDATA[<p><img src="https://miro.medium.com/v2/resize:fit:875/0*IvldBDMkyL2xfrDN" alt /></p>
<p>Data transformation is defined as the technical process of converting data from one format, standard, or structure to another — without changing the content of the datasets — to improve the data quality (<a target="_blank" href="https://www.spiceworks.com/tech/big-data/articles/what-is-data-transformation/">Spiceworks.com</a>). This is one of the important tools in statistical analysis (<a target="_blank" href="https://stats.libretexts.org/Bookshelves/Applied_Statistics/Biological_Statistics_\(McDonald\)/04%3A_Tests_for_One_Measurement_Variable/4.06%3A_Data_Transformations">Stats.LibreTexts.org</a>). By transforming raw data into a more analyzable form, it paves the way for data-driven decision making (<a target="_blank" href="https://funnel.io/blog/what-is-data-transformation">Funnel.io</a>).</p>
<p>In the previous article, we have seen a sample of Titanic dataset records in a form of card and table.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:474/0*L7fnfhg_266YnBpF.png" alt /></p>
<p>A Titanic record in card form</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/0*uUGWpb-LN6k2it2m.png" alt /></p>
<p>Titanic Records in table form</p>
<p>We can use Python to transform the data in a number of ways.</p>
<h2 id="heading-0-get-the-dataset"><strong><mark>[0] Get the dataset</mark></strong></h2>
<pre><code class="lang-plaintext">import pandas as pd
# set dataframe max column width option
pd.set_option('display.max_colwidth', None)
# set data source url
file_url='https://archive.org/download/misc-dataset/titanic.csv'
# read data
df_orig = pd.read_csv(file_url,encoding='utf-8')
# print dataframe info
print(df_orig.info())
# print dataframe head (top 5 records)
df_orig.head()
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:530/1*h3tA4DIrJQdPa2GqLRAbnw.png" alt /></p>
<p>pandas dataframe information</p>
<p>The Pandas dataframe information above tells us that some columns, i.e. the <code>Age</code> ,<code>Cabin</code> and <code>Embarked</code> , are having missing values.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*WDzaNSv5Q6NESM4kz31H9g.png" alt /></p>
<p>pandas dataframe — the first 5 records</p>
<p>The pandas dataframe sample rows above indicates that some columns, i.e. the <code>Age</code>, <code>Sex</code> and <code>Embarked</code> could be converted into <a target="_blank" href="https://developers.google.com/machine-learning/data-prep/transform/transform-categorical">numerical index</a> for a better data processing. For example …</p>
<ul>
<li><p>The Sex values i.e. <code>male</code> or <code>female</code>, could be represented by <code>0 for male</code> and <code>1 for female</code>.</p>
</li>
<li><p>The Embarked values i.e. <code>S (Southampton)</code>, <code>C (Cherbourg)</code> and <code>Q (Queenstown)</code>, could be represented by <code>0 for Southampton</code>, <code>1 for Cherbourg</code> and <code>2 for Queenstown</code>.</p>
</li>
<li><p>The Age values could be represented by Age Group (that differentiates between a child and an adult, assuming that <a target="_blank" href="https://www.reddit.com/r/titanic/comments/15snc0r/what_age_was_a_boy_no_longer_considered_a_child/">the child age is below 13</a>) e.g. <code>0 for Age&lt;13</code> and <code>1 for Age≥13</code>.</p>
</li>
</ul>
<p>Indexed numbers are just for the sake of representing categorical values; you won’t be able to compare these numbers or subtract them from each other (<a target="_blank" href="https://developers.google.com/machine-learning/data-prep/transform/transform-categorical">Developers.Google.Com</a>).</p>
<h2 id="heading-1-drop-or-impute-missing-values"><strong><mark>[1] Drop or Impute missing values</mark></strong></h2>
<p>In the above example, there were only 714 of 891 valid age related records.</p>
<pre><code class="lang-plaintext"># print the record count of missing age values
print(len(df_orig[df_orig['Age'].isna()]))

# print the record containing missing age values
df_orig[df_orig['Age'].isna()]
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*xNOH8KTSJ9zIunePIs22YQ.png" alt /></p>
<p>177 records contain missing Age values .</p>
<p>To handle these records, we may drop the records or impute their values (<a target="_blank" href="https://www.datacamp.com/tutorial/techniques-to-handle-missing-data-values">DataCamp.com</a>).</p>
<h3 id="heading-11-drop-the-records"><strong><mark>[1.1] Drop the records</mark></strong></h3>
<p>Filter the original dataframe by dropping records that contain missing Age values.</p>
<pre><code class="lang-plaintext"># filter the original dataframe by dropping records that contain missing Age values
df_filtered = df_orig.dropna(subset=['Age']).copy()
# print dataframe info
df_filtered.info()
# print dataframe head (top 5 records)
df_filtered.head()
</code></pre>
<p>Or, alternatively, apply the filter to the original dataframe itself. Bear in mind that by applying the changes to the original dataset, we will be losing some of the data that might be useful at later stages.</p>
<pre><code class="lang-plaintext"># alternatively, apply the filter to the original dataframe itself
# but we will lose the original data
df_orig.dropna(subset=['Age'], inplace=True)
# print dataframe info
print(df_orig.info())
# print dataframe head (top 5 records)
df_orig.head()
</code></pre>
<h3 id="heading-12-impute-the-values"><strong><mark>[1.2] Impute the values</mark></strong></h3>
<p>Use the rounded mean of the Age for the imputed values.</p>
<pre><code class="lang-plaintext"># impute using mean values
# get a rounded mean value for Age
mean_value = df_orig['Age'].mean().round()
print('mean_value:',mean_value)
# create a df copy of df_orig
df_imputed_mean = df_orig.copy()
# impute the Age values for the df copy
df_imputed_mean['Age'] = df_imputed_mean['Age'].fillna(mean_value)
# print df copy info
print(df_imputed_mean.info())
# print selected df copy records for Age equal mean_value 
df_imputed_mean.loc[df_imputed_mean.Age==mean_value]
# we get 202 instead of 177 (177 missing + 25 valid values)
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*UkOaycSbYKCeHU3OMRd5NQ.png" alt /></p>
<p>Or, alternatively, use the rounded median. Median values can be helpful because it is not sensitive to outliers (<a target="_blank" href="https://www.quanthub.com/how-does-the-size-of-the-dataset-impact-how-sensitive-the-mean-is-to-outliers">QuantHub.com</a>).</p>
<pre><code class="lang-plaintext"># impute using median values
# get a rounded median value for Age
median_value = df_orig['Age'].median().round()
print('median_value:',median_value)
# create a df copy of df_orig
df_imputed_median = df_orig.copy()
# impute the Age values for the df copy
df_imputed_median['Age'] = df_imputed_median['Age'].fillna(median_value)
# print df copy info
print(df_imputed_median.info())
# print selected df copy records for Age equal mean_value 
df_imputed_median.loc[df_imputed_median.Age==median_value]
# we get 202 instead of 177 (177 missing + 25 valid values)
</code></pre>
<p><mark>output:</mark></p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*qXp3Skxy54K37BQW-bNtMA.png" alt /></p>
<p><mark>Mean and Median are applicable to numeric values only.</mark></p>
<p>For categorical values (e.g. <code>Embarked</code> contains either S,C or Q values), apply the Mode (<a target="_blank" href="https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch11/mode/5214873-eng.htm">StatCan.gc.ca</a>).</p>
<pre><code class="lang-plaintext"># impute Embarked using mode values
# get a rounded median value for Embarked
embarked_mode_value = df_orig['Embarked'].mode()[0]
print('embarked_mode_value:',embarked_mode_value)
# create a df copy of df_orig
df_imputed_embarked_mode = df_orig.copy()
# impute the Embarked values for the df copy
df_imputed_embarked_mode['Embarked'] = df_imputed_embarked_mode['Embarked'].fillna(embarked_mode_value)
# print df copy info
print(df_imputed_embarked_mode.info())
# print selected df copy records for Embarked equal embarked_mode_value 
df_imputed_embarked_mode.loc[df_imputed_embarked_mode.Embarked==embarked_mode_value]
# we get 202 instead of 177 (177 missing + 25 valid values)
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*imFRCTRQOqxq64cnznRUuQ.png" alt /></p>
<h2 id="heading-2-replace-generate-dummies-or-binning-the-values"><strong><mark>[2] Replace, Generate Dummies or Binning the values</mark></strong></h2>
<p>To use index number for representing categorical data values, we may (1)<a target="_blank" href="https://www.statology.org/pandas-sample-with-replacement/">replace</a> them with the index numbers or (2)<a target="_blank" href="https://www.statology.org/pandas-get-dummies/">generate dummy</a> values for them (Statology.org).</p>
<p>To group numerical values according to certain specified ranges, we apply a technique called binning (<a target="_blank" href="https://www.scaler.com/topics/binning-in-data-mining/">Scaler.com</a>). This can help to reduce the number of unique values in the feature, which can be beneficial for encoding categorical data.</p>
<h3 id="heading-21-replace"><strong><mark>[2.1] Replace</mark></strong></h3>
<p>Use the replace() function:</p>
<pre><code class="lang-plaintext">df_imputed_embarked_mode['Embarked'].replace(['S', 'C','Q'],[0,1,2], inplace=True)
# print df
df_imputed_embarked_mode
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*tUtcPBYNKAaUYjxTZRf7Yw.png" alt /></p>
<h3 id="heading-22-generate-dummies"><strong><mark>[2.2] Generate Dummies</mark></strong></h3>
<p>The idea of generating dummies is to create new columns for each category (using them as the column names) and then assigning a value of 1 to the rows that belong to that category. Hence, they are the “dummies” of the original column.</p>
<p>Use get_dummies() function:</p>
<pre><code class="lang-plaintext"># generate dummies for Embarked

df_imputed_embarked_mode_dummies = pd.get_dummies( df_imputed_embarked_mode, columns=['Embarked']).copy()

df_imputed_embarked_mode_dummies
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*A-BonEHD7IXdQPJPWI-4qA.png" alt /></p>
<p>Be careful with “<a target="_blank" href="https://www.statology.org/dummy-variable-trap/">Dummy Variable Trap</a>” (<a target="_blank" href="https://www.statology.org/dummy-variable-trap/">Statology.org</a>) i.e. when the number of dummy variables created is equal to the number of values the categorical value can take on. This leads to multicollinearity, which causes incorrect calculations of regression coefficients and p-values. Tips: If a variable can take on N different values, create only N-1 dummy variables.</p>
<p>In Python, include a parameter <code>drop_first=True</code> for this purpose.</p>
<p>Example:</p>
<pre><code class="lang-plaintext"># avoiding dummy variable trap, 
# create only 2 dummy variables 
# from 3 different values of Embarked

df_imputed_embarked_mode_dummies = pd.get_dummies( df_imputed_embarked_mode, columns=['Embarked'], drop_first=True).copy()

df_imputed_embarked_mode_dummies
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*tP6IV86dJkRFwPUFYs09Tg.png" alt /></p>
<h3 id="heading-23-grouping-data-values-data-binning"><strong><mark>[2.3] Grouping data values (Data Binning)</mark></strong></h3>
<p>In the Titanic dataset, Age is an example of a suitable candidate for data binning.</p>
<p>Use cut() function:</p>
<pre><code class="lang-plaintext"># define labels 0=kid ie 0 to 12 years old, 1=adult ie 13 years old and above
cut_labels = [0,1]
# define cut-off points. 0 is the starting value. 12,200 are the upper limits.
cut_bins = [0,12,200]
df_imputed_median['Adult'] = pd.cut(df_imputed_median['Age'], bins=cut_bins, labels=cut_labels)
# check for ages between 11 to 14
df_imputed_median[(df_imputed_median.Age&gt;10) &amp; (df_imputed_median.Age&lt;15)]
</code></pre>
<p>output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*RzDjnyuQK2CbKm9Hd-5MFQ.png" alt /></p>
<p>(<a target="_blank" href="https://www.statology.org/data-binning-in-python/">Read further on the use of cut and qcut</a>)</p>
<h2 id="heading-colab-notebook"><strong>Colab Notebook:</strong></h2>
<h2 id="heading-google-colabhttpscolabresearchgooglecomdrive1spcpzskeogaeju8aqxjcsy9jfqnycpfsourcepostpage-493c6040a09d"><a target="_blank" href="https://colab.research.google.com/drive/1sPcpZSkeOGAEJU8AqXjcSy9_JfQNyCPF?source=post_page-----493c6040a09d---------------------------------------"><strong>Google Colab</strong></a></h2>
<h3 id="heading-python-fundamentals-for-citizen-data-scientist-2httpscolabresearchgooglecomdrive1spcpzskeogaeju8aqxjcsy9jfqnycpfsourcepostpage-493c6040a09d"><a target="_blank" href="https://colab.research.google.com/drive/1sPcpZSkeOGAEJU8AqXjcSy9_JfQNyCPF?source=post_page-----493c6040a09d---------------------------------------">Python Fundamentals For Citizen Data Scientist 2</a></h3>
<p><a target="_blank" href="https://colab.research.google.com/drive/1sPcpZSkeOGAEJU8AqXjcSy9_JfQNyCPF?source=post_page-----493c6040a09d---------------------------------------">colab.research.google.com</a></p>
<h2 id="heading-kirwn6stkio"><strong>🤓</strong></h2>
]]></content:encoded></item><item><title><![CDATA[Transforming Data From HTML Tables]]></title><description><![CDATA[Question 1: Find the source of data based on below requirement and fetch them into Power QueryQuestion 2: Find the source “public data” regarding the amount of car sales in the local market.

Question 1 answer : https://web.archive.org/web/2025112111...]]></description><link>https://labworks.razzi.my/transforming-data-from-html-tables</link><guid isPermaLink="true">https://labworks.razzi.my/transforming-data-from-html-tables</guid><category><![CDATA[Power Query]]></category><category><![CDATA[data transformation]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Mon, 16 Feb 2026 07:59:14 GMT</pubDate><content:encoded><![CDATA[<table><tbody><tr><td><p>Question 1: Find the source of data based on below requirement and fetch them into Power Query</p></td></tr><tr><td><p>Question 2: Find the source “public data” regarding the amount of car sales in the local market.</p></td></tr></tbody></table>

<p>Question 1 answer : <a target="_blank" href="https://web.archive.org/web/20251121112439/https://data.gov.my/dashboard/car-popularity">https://web.archive.org/web/20251121112439/https://data.gov.my/dashboard/car-popularity</a>  </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771227836914/c79f9978-d96c-4f1d-80ac-0987160275c3.png" alt class="image--center mx-auto" /></p>
<p>Question 2 answer: <a target="_blank" href="https://www.pcauto.com/my/sales-ranking">https://www.pcauto.com/my/sales-ranking</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771228396423/7fd8acde-e4ef-4697-ae85-2e6e2fe69419.png" alt class="image--center mx-auto" /></p>
<p><img alt /></p>
<p><img alt /></p>
<p>We managed to get the data but it is not in a ready-shape. We need to do some data transformation in order to prepare for our reporting.</p>
<p><img alt /></p>
<p>Step 1 — Add Index Column</p>
<p>Go to Add Column</p>
<p>Click Index Column → From 1</p>
<p><img alt /></p>
<p><img alt /></p>
<p>Step 2 — Create GroupID</p>
<p>Go to Add Column → Custom Column</p>
<p>Name the column: GroupID</p>
<p>Enter this formula: Number.RoundUp([Index] / 2)</p>
<p>Click OK  </p>
<p><img alt /></p>
<p><img alt /></p>
<p><img alt /></p>
<p>Step 3: Identify Model vs Quantity</p>
<p>Add another custom column:</p>
<p>Add Column → Custom Column</p>
<p>Formula: if Number.Mod([Index], 2) = 1 then "Model" else "Quantity"</p>
<p>Name it: Type  </p>
<p><img alt /></p>
<p>Step 4: Pivot the Data</p>
<p>Now we reshape the table.</p>
<p>Select the Type column</p>
<p>Go to Transform → Pivot Column</p>
<p>Values column = Column1</p>
<p>Advanced options → Don't Aggregate</p>
<p>Click OK  </p>
<p><img alt /></p>
<p><img alt /></p>
<p><img alt /></p>
<p>Step 5: Fill Down + Remove duplicates</p>
<p><img alt /></p>
<p><img alt /></p>
<p><img alt /></p>
<p><img alt /></p>
<p>Remove other columns than Model and Quantity.</p>
<p><img alt /></p>
<p><img alt /></p>
<p>Download example:</p>
<p><a target="_blank" href="https://archive.org/download/analytica/web-scrape-carmodel-quantity.xlsx">https://archive.org/download/analytica/web-scrape-carmodel-quantity.xlsx</a></p>
<p>You can view the applied steps by following below example:</p>
<p><img alt /></p>
]]></content:encoded></item><item><title><![CDATA[Python Fundamentals For Citizen Data Scientist 1 — Managing Datasets]]></title><description><![CDATA[A citizen data scientist is a person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics (Gartner).
To become a data scientist, one nee...]]></description><link>https://labworks.razzi.my/python-fundamentals-for-citizen-data-scientist-1-managing-datasets</link><guid isPermaLink="true">https://labworks.razzi.my/python-fundamentals-for-citizen-data-scientist-1-managing-datasets</guid><category><![CDATA[citizen data scientist]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Mon, 16 Feb 2026 02:13:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771207975419/3b6134f3-ffe6-4009-a001-1139e58699f2.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A citizen data scientist is a person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics (<a target="_blank" href="https://www.gartner.com/smarterwithgartner/how-to-use-citizen-data-scientists-to-maximize-your-da-strategy">Gartner</a>).</p>
<p>To become a data scientist, one needs to acquire skills in programming language such as Python (<a target="_blank" href="https://www.datacarpenter.com/post/learning-plan-citizen-data-scientist">DataCarpenter</a>).</p>
<p>The aim of this post is to introduce novices to the fundamentals of Python programming. Specifically, we will look at the data types in dataset and how to handle them in Python.</p>
<h2 id="heading-1-python-code-editor"><strong>[1] Python Code Editor</strong></h2>
<p>The simplest way of learning Python programming is through the <a target="_blank" href="https://colab.research.google.com//">Google Colab Platform</a>. Click this <a target="_blank" href="https://colab.research.google.com/">link</a> to start using it.</p>
<p>Type the following code:</p>
<p><code>print (“Hello World”)</code></p>
<p>And then, press the keyboard keys [CTRL]+[ENTER] or click the round-shaped play icon to run the code.</p>
<p>Colab will display the text:</p>
<p><code>Hello World</code></p>
<p>So easy :-) .</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*tJa57G_HHcoK4hxwhYAw6w.png" alt /></p>
<h2 id="heading-2-python-variables-and-their-basic-data-types"><strong>[2] Python Variables and their Basic Data Types</strong></h2>
<p>Computer programs need to store data in their memories before performing the data processing. In programming, data are usually called “variables”.</p>
<p>The types of data will determine the way they will be processed.</p>
<p>Some basic data types are numbers (which can be further categorized into <strong>Integers</strong> i.e. <em>discrete numbers</em> or <strong>Floats</strong> i.e. <em>fraction numbers</em>), <strong>Strings</strong> (which consists of <em>alphabets, punctuations etc.</em>), <strong>Dates</strong> and <strong>Booleans</strong> (i.e. <em>True</em> or <em>False</em>).</p>
<p>Let’s take the first record of the Titanic Dataset.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:474/1*EfcMrMBCc4YQhBXR2C-jvA.png" alt /></p>
<p>We declare variables to store the above data as follows:</p>
<pre><code class="lang-plaintext">PassengerId= 1
Survived= 0
Pclass= 3
Name= 'Braund, Mr. Owen Harris'
Sex= 'male'
Age= 22
SibSp= 1
Parch= 0
Ticket= 'A/5 21171'
Fare= 7.25
Cabin= ''
Embarked= 'S'
</code></pre>
<p><img src="https://miro.medium.com/v2/resize:fit:509/1*lk89HCAXM9z07I0ao6LEtA.png" alt /></p>
<p>Altogether there are 12 variables in the above codes that hold data in several kinds of data types.</p>
<p>Colab displays data values in red and green color; green represents Numbers (Integers or Floats) and red represents Strings. The strings are required to be enclosed by either a pair of single (<code>''</code>)or double quotes (<code>""</code>). Strings can be empty, e.g. Cabin which contains a pair of quotes without any value in between them.</p>
<p>We print the variable values using the <code>print()</code> function.</p>
<pre><code class="lang-plaintext">print (PassengerId)
print (Survived)
print (Pclass)
print (Name)
print (Sex)
print (Age)
print (SibSp)
print (Parch)
print (Ticket)
print (Fare)
print (Cabin)
print (Embarked)
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*33C5Z9m_yCoL2gN4qVSlfQ.png" alt /></p>
<p>We can also print the variable data types using the <code>type()</code> function.</p>
<pre><code class="lang-plaintext">print (type(PassengerId))
print (type(Survived))
print (type(Pclass))
print (type(Name))
print (type(Sex))
print (type(Age))
print (type(SibSp))
print (type(Parch))
print (type(Ticket))
print (type(Fare))
print (type(Cabin))
print (type(Embarked))
</code></pre>
<p>Output:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:874/1*ONWccv3kdJyON-DmDONTTw.png" alt /></p>
<p>Identifying the data types that will be used in processing is important because each data types may have different set of operations that can be performed on them.</p>
<p>Sometimes, certain data values may need to be converted into another data type prior to processing to make them more meaningful.</p>
<p>For example, the <code>Survived</code> variable in the Titanic dataset consists only either 1 or 0. 1 means “survived” and 0 means otherwise. We can convert this value into a Boolean Data Type (i.e. True of False); True means “survived” and False means otherwise. This provides more meaning instead of the numbers 1 or 0.</p>
<pre><code class="lang-plaintext"># Redeclare Survive. Convert 0 to False
Survived= False
print (Survived)
print (type(Survived))
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*VDZQUn_lIQB3XrBuZiR18w.png" alt /></p>
<p>Another example is the <code>Pclass</code> variable (which represents the passenger class) that contains either 1, 2 or 3. These values are labels and not meant for numeric calculations. They can be declared as Strings by enclosing the number with the quotes.</p>
<pre><code class="lang-plaintext"># Redeclare PClass. Convert Integer 3 to String '3'
Pclass = '3'
print(Pclass)
print(type(Pclass))
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*GL3wNMTUNEe23XUWxXi_HQ.png" alt /></p>
<p>We have seen in the above example that variables can be reassigned with new values. Each time a new value is given to a variable, its content changes, and so do its data type. Be careful with this as it may affect data processing results at later stage.</p>
<h2 id="heading-3-collection-data-types"><strong>[3] Collection Data Types</strong></h2>
<p>The above example demonstrates only one record out of the total of 891 records in the Titanic dataset.</p>
<p>Let’s look at the first 5 records.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*lzkBrqyOT4QGGo4S1itEuw.png" alt /></p>
<p>To store this kind of data, we need a collection data type. In Python, this is called a List.</p>
<p>The first record can be declared as a list as follows:</p>
<pre><code class="lang-plaintext">record=[1,0,3,'Braund, Mr. Owen Harris','male',22,1,0,'A/5 21171',7.25,'','S']
print(record)
print(type(record))
</code></pre>
<p>Output:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:845/1*nxDxHODkQmljiOq8FocJvw.png" alt /></p>
<p>To store five records, we need to declare 5 lists and enclosed them in another pair of bracket and separate each of them by a comma as follows:</p>
<pre><code class="lang-plaintext">list_record=[
    [1,0,3,'Braund, Mr. Owen Harris','male',22,1,0,'A/5 21171',7.25,'','S'],
    [2,1,1,'Cumings, Mrs. John Bradley (Florence Briggs Thayer)','female',38,1,0,'PC 17599',71.2833,'C85','C'],
    [3,1,3,'Heikkinen, Miss. Laina','female',26,0,0,'STON/O2. 3101282',7.925,'','S'],
    [4,1,1,'Futrelle, Mrs. Jacques Heath (Lily May Peel)','female',35,1,0,'113803',53.1,'C123','S'],
    [5,0,3,'Allen, Mr. William Henry','male',35,0,0,'373450',8.05,'','S']
]
print(list_record)
print(type(list_record))
print(len(list_record))
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*sy4ktFuCMl6gbkX3wx9m5w.png" alt /></p>
<p>Use the len() function to get the count of all records.</p>
<p>We can also store the values as a vertical list. For example:</p>
<pre><code class="lang-plaintext">list_passenger_id=[1,2,3,4,5]

list_survived=[0,1,1,1,0]

list_sex =['male','female','female','female','male']
</code></pre>
<p>In this way, the first item in each list represents the first record in the Titanic dataset.</p>
<p>Programming is actually a creative abstraction of real world problems :-).</p>
<p>Besides List, there are several other collection data types such as Tuple, Set and Dictionary. (Read more about them here → <a target="_blank" href="https://mohamad.razzi.my/2022/01/organize-data-using-list-tuple-set-and.html">Organize Data Using List, Tuple, Set and Dictionary</a>)</p>
<p>List data type is very useful for managing datasets.</p>
<p>Python comes with even more powerful packages for managing datasets such as Python Data Analysis (Pandas) library. Pandas saves a lot of our time and effort in data manipulation works. Let’s have a look at it.</p>
<h2 id="heading-4-pandas-dataframe"><strong>[4] Pandas DataFrame</strong></h2>
<p>Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive (<a target="_blank" href="https://pandas.pydata.org/docs/getting_started/overview.html">PyData.org</a>).</p>
<p>Pandas organize datasets in tabular-like structure that consists of a 1-Dimensional (aka Series) and 2-Dimensional (aka DataFrames) tables.</p>
<p>Since this is an additional package, we need to import it first:</p>
<pre><code class="lang-plaintext">import pandas as pd
</code></pre>
<p>To create a Series, declare as follows:</p>
<pre><code class="lang-plaintext"># create a series from list_survived
ds_survived = pd.Series(list_survived)
print(ds_survived.info())
ds_survived
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*1eMWWozgWNCSPQ9c8_2Vzg.png" alt /></p>
<p>To create a DataFrame, declare as follows:</p>
<pre><code class="lang-plaintext"># create a dataframe from list_record
df_record=pd.DataFrame(list_record)
print(df_record.info())
df_record
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*-zWgR7S9fKAeEyhH50hnhg.png" alt /></p>
<p>The list has been converted into a 2-Dimensional table known as DataFrame.</p>
<p>With DataFrames, many kinds of data manipulation tasks become seamlessly easy and more efficient.</p>
<p>Next, let’s rename the columns:</p>
<pre><code class="lang-plaintext">df_record.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
df_record
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*Y3BzabYeASl-dWmxy2VDFw.png" alt /></p>
<h2 id="heading-5-importing-datasets"><strong>[5] Importing Datasets</strong></h2>
<p>Instead of manual copy-paste jobs, we can automatically fetch data from Internet sources into Pandas dataframe.</p>
<p>For example, we can fetch the Titanic data set from <a target="_blank" href="https://archive.org/download/misc-dataset/titanic.csv">https://archive.org/download/misc-dataset/titanic.csv</a> as follows:</p>
<pre><code class="lang-plaintext">import pandas as pd
pd.set_option('display.max_colwidth', None)
file_url='https://archive.org/download/misc-dataset/titanic.csv'
df_orig = pd.read_csv(file_url,encoding='utf-8')
print(df_orig.info())
df_orig.head()
</code></pre>
<p>Output:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*1p2oWWex3f-Jek4VKoocEA.png" alt /></p>
<p>We are done with basic dataset management. Next, we will perform some transformation tasks to make the dataset more efficient for data processing.</p>
<h2 id="heading-colab-notebook"><strong>Colab Notebook:</strong></h2>
<h2 id="heading-google-colabhttpscolabresearchgooglecomdrive17ca-pf36xmskvbmjanztxjsxmnvrba5sourcepostpage-4c21b8c743bb"><a target="_blank" href="https://colab.research.google.com/drive/17ca-pf36XMSKvBMjAnZTXJSXmn_vrba5?source=post_page-----4c21b8c743bb---------------------------------------"><strong>Google Colab</strong></a></h2>
<h3 id="heading-python-fundamentals-for-citizen-data-scientist-1httpscolabresearchgooglecomdrive17ca-pf36xmskvbmjanztxjsxmnvrba5sourcepostpage-4c21b8c743bb"><a target="_blank" href="https://colab.research.google.com/drive/17ca-pf36XMSKvBMjAnZTXJSXmn_vrba5?source=post_page-----4c21b8c743bb---------------------------------------">Python Fundamentals For Citizen Data Scientist 1</a></h3>
<p><a target="_blank" href="https://colab.research.google.com/drive/17ca-pf36XMSKvBMjAnZTXJSXmn_vrba5?source=post_page-----4c21b8c743bb---------------------------------------">colab.research.google.com</a></p>
<p>🤓</p>
<p>❖❖❖❖❖❖❖❖❖❖</p>
<h2 id="heading-python-fundamentals-for-citizen-data-scientist-series"><strong>Python Fundamentals For Citizen Data Scientist Series</strong></h2>
<p>This article is a part of a series:</p>
<ol>
<li><p><a target="_blank" href="https://medium.com/p/4c21b8c743bb">Managing Datasets</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/p/493c6040a09d">Data Transformation</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/p/99bf23393ac1">Descriptive Analysis</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/p/9d3778116861">Descriptive Analysis Visualization</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/p/e4c92ad59ce5">Skewness</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/p/48495423fdd4">Regression</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/p/350881e37c6">Classification</a></p>
</li>
</ol>
<p>❖❖❖❖❖❖❖❖❖❖</p>
]]></content:encoded></item><item><title><![CDATA[Editing MS Access Database Model In Excel Power Pivot]]></title><description><![CDATA[[1] Download access database and view the content using excel application:
https://archive.org/download/oltp-olap/Financial_Sample_OLAP.accdb
[2] Create a new Excel blank worksheet
[3] Go to Data tab.
Select Get Data>From Database>From Microsoft Acce...]]></description><link>https://labworks.razzi.my/editing-ms-access-database-model-in-excel-power-pivot</link><guid isPermaLink="true">https://labworks.razzi.my/editing-ms-access-database-model-in-excel-power-pivot</guid><category><![CDATA[Excel data modeling for business analytics]]></category><category><![CDATA[Data Modeling in Excel]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Mon, 16 Feb 2026 01:07:59 GMT</pubDate><content:encoded><![CDATA[<p>[1] Download access database and view the content using excel application:</p>
<p><a target="_blank" href="https://archive.org/download/oltp-olap/Financial_Sample_OLAP.accdb">https://archive.org/download/oltp-olap/Financial_Sample_OLAP.accdb</a></p>
<p>[2] Create a new Excel blank worksheet</p>
<p>[3] Go to Data tab.</p>
<p>Select Get Data&gt;From Database&gt;From Microsoft Access Database</p>
<p><img alt /></p>
<p>[4] In the Navigator window, select multiple tables.</p>
<p><img alt /></p>
<p><img alt /></p>
<p>[5] In the Data tab, under the Data Tools section, select Manage Data Model  </p>
<p><img alt /></p>
<p>A Power Pivot for Excel window will display the tables.<br />Click Diagram View button.</p>
<p><img alt /></p>
<p>Tables will be displayed in diagram forms.</p>
<p><img alt /></p>
<p>[6] Link the tables by dragging the key fields from one table to the corresponding fields in the matching tables.  </p>
<p><img alt /></p>
<p><a target="_blank" href="https://archive.org/download/oltp-olap/Financial_Sample_OLAP.xlsx">https://archive.org/download/oltp-olap/Financial_Sample_OLAP.xlsx</a></p>
<p><a target="_blank" href="https://archive.org/download/oltp-olap/Financial_Sample_OLTP.xlsx">https://archive.org/download/oltp-olap/Financial_Sample_OLTP.xlsx</a></p>
<p><a target="_blank" href="https://archive.org/download/oltp-olap/campus_documentdb.json">https://archive.org/download/oltp-olap/campus_documentdb.json</a> </p>
<p><a target="_blank" href="https://archive.org/download/oltp-olap/campus_columndb.json">https://archive.org/download/oltp-olap/campus_columndb.json</a></p>
<p><a target="_blank" href="https://archive.org/download/oltp-olap/campus_graphdb.json">https://archive.org/download/oltp-olap/campus_graphdb.json</a></p>
]]></content:encoded></item><item><title><![CDATA[An efficient approach for textual data classification using deep learning]]></title><description><![CDATA[Abstract:Text categorization is an effective activity that can be accomplished using a variety of classification algorithms. In machine learning, the classifier is built by learning the features of categories from a set of preset training data. Simil...]]></description><link>https://labworks.razzi.my/an-efficient-approach-for-textual-data-classification-using-deep-learning</link><guid isPermaLink="true">https://labworks.razzi.my/an-efficient-approach-for-textual-data-classification-using-deep-learning</guid><category><![CDATA[labworks]]></category><category><![CDATA[lexical-analysis]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Thu, 15 Sep 2022 04:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733565588304/2bf9b0c4-9cd0-46ca-bd57-fffa103a2a71.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Abstract:<br />Text categorization is an effective activity that can be accomplished using a variety of classification algorithms. In machine learning, the classifier is built by learning the features of categories from a set of preset training data. Similarly, deep learning offers enormous benefits for text classification since they execute highly accurately with lower-level engineering and processing. This paper employs machine and deep learning techniques to classify textual data. Textual data contains much useless information that must be pre-processed. We clean the data, impute missing values, and eliminate the repeated columns. Next, we employ machine learning algorithms: logistic regression, random forest, K-nearest neighbors (KNN), and deep learning algorithms: long short-term memory (LSTM), artificial neural network (ANN), and gated recurrent unit (GRU) for classification. Results reveal that LSTM achieves 92% accuracy outperforming all other model and baseline studies.</p>
</blockquote>
<p>(Abdullah Alqahtani, H. Khan, Shtwai Alsubai, Mohemmed Sha, Ahmad S. Almadhor, Tayyab Iqbal, Sidra Abbas)</p>
<p><a target="_blank" href="https://www.frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2022.992296/full">https://www.frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2022.992296/full</a></p>
<p><a target="_blank" href="https://www.semanticscholar.org/paper/An-efficient-approach-for-textual-data-using-deep-Alqahtani-Khan/535455a9c44c5e783da02b49299069f6a225d647">https://www.semanticscholar.org/paper/An-efficient-approach-for-textual-data-using-deep-Alqahtani-Khan/535455a9c44c5e783da02b49299069f6a225d647</a></p>
<hr />
<h1 id="heading-discussion">Discussion:</h1>
<p>The paper <em>"An efficient approach for textual data classification using deep learning"</em> brings attention to the potential of machine and deep learning models, such as LSTM and GRU, for text classification tasks. Interestingly, the authors use the Titanic dataset for their experiments, which primarily contains structured data and limited text fields. This choice raises intriguing questions about how text-focused models can be adapted for datasets that are not traditionally text-heavy. Could this approach point to new ways of extracting or representing textual features from structured data? Or does it highlight the importance of selecting datasets that align more closely with a study's goals? This opens up a larger conversation about balancing creativity in research with ensuring methodological alignment, inviting us to reflect on how we choose and use datasets in machine learning studies.</p>
<h1 id="heading-lab-works">Lab Works:</h1>
<p><img src="https://www.frontiersin.org/files/Articles/992296/fncom-16-992296-HTML/image_m/fncom-16-992296-t002.jpg" alt="www.frontiersin.org" /></p>
<pre><code class="lang-plaintext">1. V ← LE(data)  {Label Encoding}  
2. μ ← (1/m) * Σ(i=1 to m) X^(i)  {Normalizing data}  
3. X ← X - μ  
4. σ² ← (1/m) * Σ(i=1 to m) (X^(i))²  
5. X ← X / σ²  
6. D2 ← np.array(Df)  {Convergence of Matrix}  
7. for l in range(1, len(L))  {Weight Initialization}  
   1. W[l] ← rand((m × n)) * √(2 / n[l-1])  
8. end for  
9. V ← MaxPooling(F)  {Conversion of Vector}  
10. lstm ← LSTM(V)  {LSTM layer}  
11. f_lstm ← Hidden(lstm)  {Hidden layer}  
12. PC ← PredictClass(f_lstm)  {Dense layer}  
13. for i in range(1, len(PC)) do  
    1. if PC[i] == y_test[i] then  
       1. return PC[i]  
    2. else  
       1. return y_test[i]  
    3. end if  
14. end for  
15. return Output
</code></pre>
<p><a target="_blank" href="https://colab.research.google.com/drive/1NLegXg7WTNgajyU1XgX1g-Qpe9r6Fwqp">https://colab.research.google.com/drive/1NLegXg7WTNgajyU1XgX1g-Qpe9r6Fwqp</a></p>
]]></content:encoded></item><item><title><![CDATA[N-gram-based text categorization]]></title><description><![CDATA[Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual ...]]></description><link>https://labworks.razzi.my/n-gram-based-text-categorization</link><guid isPermaLink="true">https://labworks.razzi.my/n-gram-based-text-categorization</guid><category><![CDATA[labworks]]></category><category><![CDATA[lexical-analysis]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Sat, 31 Dec 1994 04:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733731683839/060f4821-12fe-4501-95aa-3574d3d13748.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems.  </p>
<p>We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8% correct classification rate on Usenet newsgroup articles written in different languages. The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80% correct classification rate. There are also several obvious directions for improving the system`s classification performance in those cases where it did not do as well.  </p>
<p>The system is based on calculating and comparing profiles of N-gram frequencies. First, we use the system to compute profiles on training set data that represent the various categories, e.g., language samples or newsgroup content samples. Then the system computes a profile for a particular document that is to be classified. Finally, the system computes a distance measure between the document`s profile and each of the category profiles. The system selects the category whose profile has the smallest distance to the document`s profile. The profiles involved are quite small, typically 10K bytes for a category training set, and less than 4K bytes for an individual document.  </p>
<p>Using N-gram frequency profiles provides a simple and reliable way to categorize documents in a wide range of classification tasks.</p>
</blockquote>
<p>(W. B. Cavnar, J. Trenkle)</p>
<p><a target="_blank" href="https://sdmines.sdsmt.edu/upload/directory/materials/12247_20070403135416.pdf">https://sdmines.sdsmt.edu/upload/directory/materials/12247_20070403135416.pdf</a></p>
<p><a target="_blank" href="https://www.semanticscholar.org/paper/N-gram-based-text-categorization-Cavnar-Trenkle/49af572ef8f7ea89db06d5e7b66e9369c22d7607">https://www.semanticscholar.org/paper/N-gram-based-text-categorization-Cavnar-Trenkle/49af572ef8f7ea89db06d5e7b66e9369c22d7607</a></p>
<hr />
<h3 id="heading-methodology"><strong>Methodology</strong></h3>
<ol>
<li><p><strong>N-Gram Generation</strong>:</p>
<ul>
<li><p>The algorithm extracts all possible contiguous sequences of n characters (n-grams) from a text.</p>
</li>
<li><p>These n-grams are then ranked based on their frequency of occurrence.</p>
</li>
</ul>
</li>
<li><p><strong>Profile Construction</strong>:</p>
<ul>
<li><p>Each document or category is represented as a profile containing the most frequent n-grams (e.g., top 300).</p>
</li>
<li><p>A similar profile is generated for the text being classified.</p>
</li>
</ul>
</li>
<li><p><strong>Similarity Comparison</strong>:</p>
<ul>
<li><p>The categorization task involves comparing the text's n-gram profile against the profiles of known categories.</p>
</li>
<li><p>The similarity metric used is the <strong>rank-order distance</strong>, which measures how closely the n-gram frequencies in the input text align with those in the category profiles.</p>
</li>
</ul>
</li>
<li><p><strong>Language Identification</strong>:</p>
<ul>
<li>The paper tested the method extensively for language identification, demonstrating its capability to distinguish between languages with high accuracy.</li>
</ul>
</li>
</ol>
<hr />
<h2 id="heading-discussions">Discussions:</h2>
<p>While the n-character-based approach by Cavnar and Trenkle is effective and simple, it raises interesting questions about its broader use. For example, how well does it handle texts where different categories share similar patterns, or when semantic meaning is important?</p>
<hr />
<h2 id="heading-lab-works">Lab Works:</h2>
<p><a target="_blank" href="https://colab.research.google.com/drive/1ciPDoOmyI6tgOEt17PQczsjGaPc9uMxd">https://colab.research.google.com/drive/1ciPDoOmyI6tgOEt17PQczsjGaPc9uMxd</a></p>
]]></content:encoded></item><item><title><![CDATA[Word Association Norms, Mutual Information, and Lexicography]]></title><description><![CDATA[Abstract:
The term word association is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor. ) We will ex...]]></description><link>https://labworks.razzi.my/word-association-norms-mutual-information-and-lexicography</link><guid isPermaLink="true">https://labworks.razzi.my/word-association-norms-mutual-information-and-lexicography</guid><category><![CDATA[labworks]]></category><category><![CDATA[lexical-analysis]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Mon, 26 Jun 1989 04:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733468709492/ce44bbae-1bd5-4530-a8d7-bf4718afe82c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Abstract:</p>
<p>The term word association is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor. ) We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). This paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.</p>
<p>(Kenneth Ward Church, Patrick Hanks)</p>
</blockquote>
<p><a target="_blank" href="https://www.semanticscholar.org/paper/Word-Association-Norms%2C-Mutual-Information%2C-and-Church-Hanks">https://www.semanticscholar.org/paper/Word-Association-Norms%2C-Mutual-Information%2C-and-Church-Hanks</a></p>
<p><a target="_blank" href="https://aclanthology.org/P89-1010.pdf">https://aclanthology.org/P89-1010.pdf</a></p>
<h3 id="heading-key-concepts">Key Concepts</h3>
<ol>
<li><p><strong>Mutual Information</strong>:</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733407233022/bc7df284-1676-48e1-b763-f46d4cb408ee.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>MI compares the <strong>joint probability</strong> of observing two events (or words) together, P(x,y), with the probability of observing them independently, P(x)P(y).</p>
</li>
<li><p>If P(x,y) is significantly larger than P(x)P(y), it indicates a strong association, resulting in I(x,y)&gt;0.</p>
</li>
<li><p>Conversely, if P(x,y) is similar to P(x)P(y), then I(x,y)≈0, suggesting no significant relationship.</p>
</li>
<li><p>If x and y are in <strong>complementary distribution</strong>, they do not occur together. This means that P(x,y) is very low or approaches zero..</p>
</li>
</ul>
</li>
<li><p><strong>Estimation of Probabilities</strong>:</p>
<ul>
<li><p>The probabilities P(x) and P(y) are estimated by counting occurrences in a corpus, denoted as f(x) and f(y), and normalizing by the total corpus size N.</p>
</li>
<li><p>Joint probabilities P(x,y) are estimated by counting how often x is followed by y within a specified window size www (e.g., 5 words).</p>
</li>
</ul>
</li>
<li><p><strong>Window Size</strong>:</p>
<ul>
<li><p>The choice of window size affects the type of relationships captured:</p>
<ul>
<li><p><strong>Smaller windows</strong> identify fixed expressions (like idioms).</p>
</li>
<li><p><strong>Larger windows</strong> capture broader semantic relationships.</p>
</li>
</ul>
</li>
<li><p>A window size of <strong>5 words</strong> is chosen as a compromise to balance capturing meaningful relationships without losing contextual adjacency.</p>
</li>
</ul>
</li>
<li><p><strong>Count Threshold</strong>:</p>
<ul>
<li>The authors set a threshold, avoiding pairs with very small counts (e.g., f(x,y)&lt;5), to maintain stability in the association ratio. This avoids unreliable estimates that can arise from low counts.</li>
</ul>
</li>
<li><p><strong>Symmetry in Probabilities</strong>:</p>
<ul>
<li><p>MI is symmetric (P(x,y)=P(y,x)), meaning the relationship holds regardless of the order of the words.</p>
</li>
<li><p>The association ratio, however, is not symmetric because it captures linear precedence (the order of appearance). This asymmetry can reveal interesting biases or relationships in data, such as syntactic patterns or sociolinguistic trends.</p>
</li>
</ul>
</li>
</ol>
<h1 id="heading-lab-works">Lab Works:</h1>
<p><a target="_blank" href="https://colab.research.google.com/drive/1f5yfmhAocDZ9bHeg1QKXY_086dEHIVTi">https://colab.research.google.com/drive/1f5yfmhAocDZ9bHeg1QKXY_086dEHIVTi</a></p>
]]></content:encoded></item><item><title><![CDATA[Frequency Analysis of English Usage: Lexicon and Grammar]]></title><description><![CDATA[Abstract:
This volume presents the results of a lexical and grammatical analysis of a one-million-word corpus of present-day American English, originally assembled at Brown University in 1963-64 and thus commonly referred to by researchers interested...]]></description><link>https://labworks.razzi.my/frequency-analysis-of-english-usage-lexicon-and-grammar</link><guid isPermaLink="true">https://labworks.razzi.my/frequency-analysis-of-english-usage-lexicon-and-grammar</guid><category><![CDATA[labworks]]></category><category><![CDATA[lexical-analysis]]></category><dc:creator><![CDATA[Mohamad Mahmood]]></dc:creator><pubDate>Fri, 01 Jan 1982 04:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733553954395/6405a75a-1ab4-4acf-8eca-84c5e3179120.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Abstract:</p>
<p>This volume presents the results of a lexical and grammatical analysis of a one-million-word corpus of present-day American English, originally assembled at Brown University in 1963-64 and thus commonly referred to by researchers interested in text analysis as the Brown Corpus. The Brown Corpus, which was compiled with the view of making it broadly representative of current edited American English, contains selections from five hundred samples belonging to fifteen different genres of writing. The genres ranges from newspaper reportage to technical writing, and from philosophical essays to various kinds of fiction.</p>
</blockquote>
<p>(W. Nelson Francis (Author), Henry Kucera (Author), Andrew W. Mackie (Author))</p>
<p><a target="_blank" href="https://www.amazon.com/FREQUENCY-ANALYSIS-ENGLISH-USAGE-LEXICON/dp/0395322502">https://www.amazon.com/FREQUENCY-ANALYSIS-ENGLISH-USAGE-LEXICON/dp/0395322502</a></p>
<p><a target="_blank" href="https://journals.sagepub.com/doi/abs/10.1177/007542428501800107">https://journals.sagepub.com/doi/abs/10.1177/007542428501800107</a></p>
<hr />
<p>The introduction of the paper "Frequency Analysis of English Usage: Lexicon and Grammar" sets the stage for a comprehensive exploration of how frequency data can illuminate patterns in English language usage.</p>
<ul>
<li><p><strong>Purpose of the Study</strong>: The paper aims to analyze the frequency of words and grammatical structures in English, providing insights into their usage in various contexts. This analysis is crucial for understanding language patterns and can inform both linguistic theory and practical applications in language education and computational linguistics .</p>
</li>
<li><p><strong>Importance of Frequency Analysis</strong>: The introduction emphasizes that frequency analysis is a valuable tool for linguists. It allows researchers to identify which words and grammatical forms are most commonly used, thereby revealing trends in language evolution and usage. This can help in distinguishing between standard and non-standard forms of English .</p>
</li>
<li><p><strong>Methodological Framework</strong>: The authors outline the methodological approach they will employ, which includes the collection of large corpora of English text. By analyzing these corpora, the study seeks to quantify the frequency of various lexical items and grammatical constructions, providing a robust statistical basis for their findings .</p>
</li>
<li><p><strong>Relevance to Language Learning</strong>: The introduction also touches on the implications of frequency analysis for language teaching. Understanding which words and structures are most frequently used can guide educators in developing curricula that prioritize these elements, thus enhancing language acquisition for learners .</p>
</li>
<li><p><strong>Contribution to Linguistic Research</strong>: Finally, the authors position their work within the broader field of linguistic research, suggesting that their findings will contribute to ongoing discussions about language use, change, and the relationship between lexicon and grammar. This positions the study as a significant addition to the existing literature on English linguistics .</p>
</li>
</ul>
<p>In summary, the introduction effectively outlines the study's objectives, significance, and methodological approach, setting a clear framework for the subsequent analysis presented in the paper.</p>
<hr />
<hr />
<h1 id="heading-lab-works">Lab Works</h1>
<p><a target="_blank" href="https://colab.research.google.com/drive/1e670xj2jk-NUFQW8K3VKRUeTCRxNNpfM">https://colab.research.google.com/drive/1e670xj2jk-NUFQW8K3VKRUeTCRxNNpfM</a></p>
]]></content:encoded></item></channel></rss>