Show List

Text classification

Text classification is the process of assigning predefined categories or labels to text data based on its content. The goal is to build a model that can automatically classify new, unseen text into one of the predefined categories. This can be useful in many applications, such as spam filtering, sentiment analysis, and topic classification.

Here's an example of text classification for sentiment analysis, using the movie review dataset from the popular NLTK library in Python:

python
Copy code
import nltk nltk.download('movie_reviews') from nltk.corpus import movie_reviews import random # Create a list of documents, where each document is a tuple containing the text and the label documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] # Shuffle the documents to ensure that the positive and negative reviews are evenly distributed random.shuffle(documents) # Define a feature extractor function that extracts the 2000 most frequent words as features all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = list(all_words)[:2000] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains({})'.format(word)] = (word in document_words) return features # Extract features for all documents featuresets = [(document_features(d), c) for (d,c) in documents] # Split the data into training and testing sets train_set, test_set = featuresets[:1600], featuresets[1600:] # Train a Naive Bayes classifier on the training data classifier = nltk.NaiveBayesClassifier.train(train_set) # Test the classifier on the testing data accuracy = nltk.classify.accuracy(classifier, test_set) print('Accuracy:', accuracy)

In this example, we are using the movie review dataset from NLTK, which contains a collection of positive and negative movie reviews. We first create a list of documents, where each document is a tuple containing the text and the label (positive or negative). We then shuffle the documents to ensure that the positive and negative reviews are evenly distributed.

Next, we define a feature extractor function that extracts the 2000 most frequent words in the corpus as features. For each document, we extract the features and create a featureset, which is a dictionary of feature names and their boolean values (indicating whether the document contains the feature or not). We then split the data into training and testing sets, with 1600 documents in the training set and the rest in the testing set.

We train a Naive Bayes classifier on the training data using the nltk.NaiveBayesClassifier class, which implements the Naive Bayes algorithm. Finally, we test the classifier on the testing data using the nltk.classify.accuracy function, which returns the classification accuracy of the classifier. In this example, we are able to achieve an accuracy of around 77% using the Naive Bayes classifier.

This is just a simple example of text classification for sentiment analysis, but similar techniques can be used for other classification tasks as well. There are many different algorithms and techniques for text classification, including Naive Bayes, Support Vector Machines, and deep learning models, and the choice of algorithm depends on the specific task and the nature of the data.


    Leave a Comment


  • captcha text