This project focuses on detecting spoilers in user-generated movie reviews by applying machine learning models. Using a dataset from IMDB, the project involves preprocessing, feature engineering, model training, evaluation, and hyperparameter tuning.
- Total Records: 573,913
- Users: 263,407
- Movies: 1,572
- Spoiler Reviews: 150,924
- Users with at Least One Spoiler Review: 79,039
- Items with at Least One Spoiler Review: 1,570
Choices Faced When Selecting the Best Model
- Balancing Precision and Recall:
- Precision measures how many selected items are relevant, while recall measures how many relevant items are selected.
- A model with high precision but low recall (e.g.,
base_pytorch_lstm_best
) might be good for applications where false positives are costly.
- Conversely, a model with high recall but lower precision (e.g.,
smote_k-nearest_neighbors
) might be suitable where missing a spoiler (false negative) is more detrimental.
- Considering Model Complexity and Training Time:
- Neural networks like
base_pytorch_lstm_best
and base_pytorch_ff_best
show good performance but at the cost of higher training and evaluation times.
- Models like
base_lightgbm_tuned
and base_xgboost
offer a balance between performance and efficiency, making them attractive choices for practical deployment.
- Weighted Metrics:
- Weighted metrics consider the support of each class, providing a more balanced view of the model's performance across all data points.
- For instance,
base_lightgbm_tuned
and base_pytorch_lstm_best
score high on weighted average metrics, indicating robust overall performance.
- Hyperparameter Tuning:
- Tuned models (
base_lightgbm_tuned
, base_xgboost_tuned
) often perform better than their untuned counterparts.
- Investing time in hyperparameter optimization can yield significant performance gains.
- Evaluation Time:
- For real-time or near-real-time applications, models with lower evaluation times (e.g.,
base_linear_svc
, base_logistic_regression
) might be preferred despite slightly lower accuracy or F1-scores.
smote_logistic_regression
has the best evaluation time, which might be crucial for high-throughput systems.
- Use Case Specific Metrics:
- Depending on the end-use case, certain metrics might be prioritized over others. For instance, if avoiding spoilers at any cost is critical, models with higher
1_recall
or 1_f1-score
would be prioritized, despite the increased probability of false positives.
Conclusion
Selecting the best model for the Movie Spoiler Detector project involves balancing multiple performance metrics, considering model complexity, and evaluating practical constraints like training and evaluation times. The choice ultimately depends on the specific requirements of the deployment environment and the relative importance of precision, recall, and overall efficiency.
Models
Machine Learning Models
Model | Description |
adaboost_model | Combines multiple weak classifiers to form a strong classifier by adjusting weights on errors. |
decision_tree_model | Splits data into subsets based on feature values, forming an intuitive, interpretable tree. |
gradient_boosting_model | Sequentially builds models to correct errors, optimizing for the loss function using gradient descent. |
k-nearest_neighbors_model | Classifies instances based on the majority class among nearest neighbors in feature space. |
lightgbm_model | Efficient gradient boosting framework for large datasets and high-dimensional data. |
linear_svc_model | Linear Support Vector Classifier effective in high-dimensional spaces, suitable for text classification. |
logistic_regression_model | Models binary outcomes using the logistic function, simple and interpretable. |
random_forest_model | Ensemble method building multiple decision trees, reducing overfitting and handling large datasets. |
sgd_classifier_model | Uses stochastic gradient descent to minimize loss, suitable for large-scale learning problems. |
xgboost_model | Optimized gradient boosting library, efficient and effective for structured/tabular data. |
Top 10 Models by Accuracy
Model |
Accuracy |
0 Precision |
0 F1-score |
1 Precision |
1 F1-score |
Weighted Avg Precision |
Weighted Avg F1-score |
Eval Time |
base_pytorch_lstm_best | 0.7934 | 0.8170 | 0.8683 | 0.6746 | 0.5205 | 0.7793 | 0.7762 | 25.18 |
base_lightgbm_tuned | 0.7928 | 0.8082 | 0.8699 | 0.7008 | 0.4922 | 0.7798 | 0.7699 | 0.44 |
base_pytorch_ff_best | 0.7906 | 0.8190 | 0.8657 | 0.6573 | 0.5245 | 0.7762 | 0.7754 | 19.90 |
base_lightgbm | 0.7884 | 0.8036 | 0.8675 | 0.6928 | 0.4740 | 0.7743 | 0.7634 | 0.11 |
base_xgboost | 0.7882 | 0.8060 | 0.8669 | 0.6830 | 0.4824 | 0.7734 | 0.7651 | 0.12 |
smote_xgboost | 0.7820 | 0.8050 | 0.8623 | 0.6537 | 0.4767 | 0.7649 | 0.7602 | 0.14 |
base_linear_svc | 0.7817 | 0.7988 | 0.8636 | 0.6719 | 0.4535 | 0.7652 | 0.7550 | 0.06 |
base_logistic_regression | 0.7815 | 0.8030 | 0.8625 | 0.6572 | 0.4696 | 0.7644 | 0.7585 | 0.06 |
smote_lightgbm | 0.7802 | 0.8072 | 0.8604 | 0.6395 | 0.4836 | 0.7628 | 0.7606 | 22.37 |
base_gradient_boosting | 0.7794 | 0.7894 | 0.8642 | 0.6986 | 0.4127 | 0.7654 | 0.7447 | 0.46 |
Best Models by Metric
Metric | Best Performing Model |
0_precision | smote_k-nearest_neighbors |
0_recall | base_random_forest |
0_f1-score | base_lightgbm_tuned |
0_support | base_adaboost... |