This project applies Support Vector Machine (SVM) to classify Amazon product reviews as positive or negative. The feature extraction pipeline uses TF-IDF vectorization, and PCA (implemented from scratch) is used for dimensionality reduction. The regularization parameter is tuned using 5-Fold Cross Validation.
where the first term is L2 regularization and the second term is hinge loss.
project/
│
├── data/
│ └── raw/
│ └── amazon_cells_labelled.txt # raw dataset (1000 reviews)
│
├── data_cleaner.py # Stage 1-2: text cleaning + train/test split
├── feature_engineering.py # Stage 3a: TF-IDF feature extraction
├── pca.py # Stage 3b: PCA from scratch (no sklearn.decomposition)
├── svm_classifier.py # Stage 4: SVM + Cross Validation
├── visualization.py # Stage 5: report figures
├── predictor.py # Stage 6: save model + predict new reviews
│
├── main.py # main entry — runs Stage 1 through 6
├── predict_standalone.py # predict without retraining
├── verify_pca.py # verify handcrafted PCA against sklearn
│
├── saved_models/ # (generated) saved model files
│ └── best_model.pkl
│
└── README.md
Stage 1 clean_text() Raw text → lowercase, remove punctuation, truncate
↓
Stage 2 split_data() 80/20 train/test split (stratified)
↓
Stage 3a tfidf_features() Text → TF-IDF sparse matrix (494 dimensions)
↓
Stage 3b apply_pca() TF-IDF → PCA reduced (50D / 200D) — from scratch
↓
Stage 4a tune_C_with_cv() 5-Fold Cross Validation to find best C
↓
Stage 4b run_all_experiments() 4 comparison experiments
↓
Stage 5 plot_*() Generate 5 report figures
↓
Stage 6 save + predict Save best model, demo predictions
| # | Method | Features | Purpose |
|---|---|---|---|
| 1 | Linear SVM + TF-IDF | 494D (original) | Baseline |
| 2 | Linear SVM + PCA-50D | 50D (reduced) | Aggressive reduction |
| 3 | Linear SVM + PCA-200D | 200D (reduced) | Moderate reduction |
| 4 | RBF SVM + TF-IDF | 494D (original) | Nonlinear kernel |
Python 3.8+
numpy
pandas
scikit-learn
matplotlib
pip install numpy pandas scikit-learn matplotlib
python main.py
This will:
saved_models/best_model.pklpython predict_standalone.py
python verify_pca.py
Compares handcrafted PCA (eigendecomposition) against sklearn PCA (SVD) step by step.
All parameters are in main.py under CONFIG:
CONFIG = {
'max_chars': 100, # truncate reviews to N characters
'test_size': 0.2, # 0.2 = 80/20 split
'max_features': 1000, # TF-IDF vocabulary size cap
'ngram_range': (1, 2), # unigram + bigram
'pca_dims': [50, 200], # PCA dimensions to test
'cv_folds': 5, # Cross Validation folds
'interactive_mode': False, # True = interactive prediction after training
}
| Figure | Filename | Report Section |
|---|---|---|
| PCA cumulative variance curve | fig0_pca_variance.png |
Section IV |
| Confusion matrices (all experiments) | fig1_confusion_matrices.png |
Section IV |
| Accuracy & F1 comparison | fig2_accuracy.png |
Section IV |
| Training/testing time comparison | fig3_time.png |
Section IV |
| CV accuracy vs C (with error bars) | fig4_cv_tuning.png |
Section IV |
The PCA implementation in pca.py does not use sklearn.decomposition.PCA. It follows the mathematical formulation directly:
numpy.linalg.eighMyStandardScaler is also handcrafted (no sklearn.preprocessing.StandardScaler).
Verified against sklearn on real training data — eigenvalues match within 0.5% relative error (due to eigh vs SVD numerical differences, not a bug).
Sentiment Labelled Sentences Data Set from UCI Machine Learning Repository.
sentence \t label (label: 0 = negative, 1 = positive)Reference:
Kotzias et al., “From Group to Individual Labels using Deep Features,” KDD 2015.
| Report Section | Code Module |
|---|---|
| I. Introduction | Background of sentiment analysis + SVM/PCA theory |
| II. Optimization Problem | SVM hinge loss formula + PCA optimization formula |
| III. Solution Methods | Pipeline description + CV tuning + 4 experiments |
| IV. Simulation Results | Figures generated by visualization.py |
| V. Conclusion | Summary from print_summary() |
| Appendix | pca.py (handcrafted PCA code) |