N-gram-based text categorization

Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems.
We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8% correct classification rate on Usenet newsgroup articles written in different languages. The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80% correct classification rate. There are also several obvious directions for improving the system`s classification performance in those cases where it did not do as well.
The system is based on calculating and comparing profiles of N-gram frequencies. First, we use the system to compute profiles on training set data that represent the various categories, e.g., language samples or newsgroup content samples. Then the system computes a profile for a particular document that is to be classified. Finally, the system computes a distance measure between the document`s profile and each of the category profiles. The system selects the category whose profile has the smallest distance to the document`s profile. The profiles involved are quite small, typically 10K bytes for a category training set, and less than 4K bytes for an individual document.
Using N-gram frequency profiles provides a simple and reliable way to categorize documents in a wide range of classification tasks.
(W. B. Cavnar, J. Trenkle)
https://sdmines.sdsmt.edu/upload/directory/materials/12247_20070403135416.pdf
Methodology
N-Gram Generation:
The algorithm extracts all possible contiguous sequences of n characters (n-grams) from a text.
These n-grams are then ranked based on their frequency of occurrence.
Profile Construction:
Each document or category is represented as a profile containing the most frequent n-grams (e.g., top 300).
A similar profile is generated for the text being classified.
Similarity Comparison:
The categorization task involves comparing the text's n-gram profile against the profiles of known categories.
The similarity metric used is the rank-order distance, which measures how closely the n-gram frequencies in the input text align with those in the category profiles.
Language Identification:
- The paper tested the method extensively for language identification, demonstrating its capability to distinguish between languages with high accuracy.
Discussions:
While the n-character-based approach by Cavnar and Trenkle is effective and simple, it raises interesting questions about its broader use. For example, how well does it handle texts where different categories share similar patterns, or when semantic meaning is important?
Lab Works:
https://colab.research.google.com/drive/1ciPDoOmyI6tgOEt17PQczsjGaPc9uMxd



