Description
Project Description
Train a LinearSVC() classifier and SVC() classifiers with, respectively, the Linear (‘linear’)
and RBF (‘rbf’) kernels provided by Sci-kit Learn (package sklearn) to predict ‘Priority’
labels for the Java Development Tools Bug dataset. (Note that LinearSVC() and SVC() with
a linear kernel are not the same thing due to different internal algorithmic settings.) Your project
consists of the following tasks:
1. Clean and remove short bug reports as in TextClassification.ipynb. Organize the
remaining data to a data frame called df.
2. Generate a balanced data frame called df_balanced from df by restricting the number of
entries in the P3 category to 5000.
3. Lemmatize words and use only ‘nav’ lemmas for classifications, where ‘nav’ stands for
nouns, pronouns, adjectives, adverbs, and verbs as in FeatureExtraction.
4. Use grid search with 5-fold cross validation on LinearSVC() to determine the best
hyperparameters on the following grid with a pipeline on tfidf and LinearSVC:



