Project Overview

This project focuses on detecting spoilers in user-generated movie reviews by applying machine learning models. Using a dataset from IMDB, the project involves preprocessing, feature engineering, model training, evaluation, and hyperparameter tuning.

Dataset

  • Total Records: 573,913
  • Users: 263,407
  • Movies: 1,572
  • Spoiler Reviews: 150,924
  • Users with at Least One Spoiler Review: 79,039
  • Items with at Least One Spoiler Review: 1,570

Choices Faced When Selecting the Best Model

  1. Balancing Precision and Recall:
    • Precision measures how many selected items are relevant, while recall measures how many relevant items are selected.
    • A model with high precision but low recall (e.g., base_pytorch_lstm_best) might be good for applications where false positives are costly.
    • Conversely, a model with high recall but lower precision (e.g., smote_k-nearest_neighbors) might be suitable where missing a spoiler (false negative) is more detrimental.
  2. Considering Model Complexity and Training Time:
    • Neural networks like base_pytorch_lstm_best and base_pytorch_ff_best show good performance but at the cost of higher training and evaluation times.
    • Models like base_lightgbm_tuned and base_xgboost offer a balance between performance and efficiency, making them attractive choices for practical deployment.
  3. Weighted Metrics:
    • Weighted metrics consider the support of each class, providing a more balanced view of the model's performance across all data points.
    • For instance, base_lightgbm_tuned and base_pytorch_lstm_best score high on weighted average metrics, indicating robust overall performance.
  4. Hyperparameter Tuning:
    • Tuned models (base_lightgbm_tuned, base_xgboost_tuned) often perform better than their untuned counterparts.
    • Investing time in hyperparameter optimization can yield significant performance gains.
  5. Evaluation Time:
    • For real-time or near-real-time applications, models with lower evaluation times (e.g., base_linear_svc, base_logistic_regression) might be preferred despite slightly lower accuracy or F1-scores.
    • smote_logistic_regression has the best evaluation time, which might be crucial for high-throughput systems.
  6. Use Case Specific Metrics:
    • Depending on the end-use case, certain metrics might be prioritized over others. For instance, if avoiding spoilers at any cost is critical, models with higher 1_recall or 1_f1-score would be prioritized, despite the increased probability of false positives.

Conclusion

Selecting the best model for the Movie Spoiler Detector project involves balancing multiple performance metrics, considering model complexity, and evaluating practical constraints like training and evaluation times. The choice ultimately depends on the specific requirements of the deployment environment and the relative importance of precision, recall, and overall efficiency.

Models

Machine Learning Models

ModelDescription
adaboost_modelCombines multiple weak classifiers to form a strong classifier by adjusting weights on errors.
decision_tree_modelSplits data into subsets based on feature values, forming an intuitive, interpretable tree.
gradient_boosting_modelSequentially builds models to correct errors, optimizing for the loss function using gradient descent.
k-nearest_neighbors_modelClassifies instances based on the majority class among nearest neighbors in feature space.
lightgbm_modelEfficient gradient boosting framework for large datasets and high-dimensional data.
linear_svc_modelLinear Support Vector Classifier effective in high-dimensional spaces, suitable for text classification.
logistic_regression_modelModels binary outcomes using the logistic function, simple and interpretable.
random_forest_modelEnsemble method building multiple decision trees, reducing overfitting and handling large datasets.
sgd_classifier_modelUses stochastic gradient descent to minimize loss, suitable for large-scale learning problems.
xgboost_modelOptimized gradient boosting library, efficient and effective for structured/tabular data.

Results

Top 10 Models by Accuracy

Model Accuracy 0 Precision 0 F1-score 1 Precision 1 F1-score Weighted Avg Precision Weighted Avg F1-score Eval Time
base_pytorch_lstm_best0.79340.81700.86830.67460.52050.77930.776225.18
base_lightgbm_tuned0.79280.80820.86990.70080.49220.77980.76990.44
base_pytorch_ff_best0.79060.81900.86570.65730.52450.77620.775419.90
base_lightgbm0.78840.80360.86750.69280.47400.77430.76340.11
base_xgboost0.78820.80600.86690.68300.48240.77340.76510.12
smote_xgboost0.78200.80500.86230.65370.47670.76490.76020.14
base_linear_svc0.78170.79880.86360.67190.45350.76520.75500.06
base_logistic_regression0.78150.80300.86250.65720.46960.76440.75850.06
smote_lightgbm0.78020.80720.86040.63950.48360.76280.760622.37
base_gradient_boosting0.77940.78940.86420.69860.41270.76540.74470.46

Best Models by Metric

MetricBest Performing Model
0_precisionsmote_k-nearest_neighbors
0_recallbase_random_forest
0_f1-scorebase_lightgbm_tuned
0_supportbase_adaboost...