Custom Ensemble Learning Project

Goal

This project is primarily educational, aiming to understand various ensemble learning algorithms by implementing them from scratch using numpy for array processing and custom datasets.

Overview

This project includes implementations of custom random forest classifiers and regressors, as well as a gradient boosted decision tree regressor.

Main Python Files

dataPrep.py: Contains functions for data preparation and one-hot encoding
decisionTreeClassifier.py: Implements a decision tree classifier
decisionTreeRegressor.py: Implements a decision tree regressor
randomForestClassifier.py: Implements a custom random forest classifier
randomForestClassifierPar.py: Updated randomForestClassifier.py to implemented with multiprocessing for performance gain.
randomForestRegressor.py: Implements a custom random forest regressor
gradientBoostedRegressor.py: Implements a gradient boosted decision tree regressor
runRandomForest.py: Functions to run random forest on various datasets

Included CSV Files

Wisconsin_breast_prognostic.csv: Dataset for Wisconsin Breast Prognostic analysis
pima-indians-diabetes.csv: Dataset for Pima Indians Diabetes analysis
output_May-06-2024_cleaned.csv: Car listing data from "used_car_price_visualization" repository

Example Results

Multiprocessing Performance Gain

randomForestClassifierPar vs randomForestClassifier

Usage Example

This run function in gradientBoostedRegressor.py demonstrates how to prepare data, train the model, and evaluate its performance:


def run():
	"""
	Runs Gradient Boosted Decision Trees on the given dataset.
	"""
	# Source file location
	file_orig = "data/carsDotCom.csv"

	# Prepare and format data
	df, file_loc = dp.DataPrep.prepare_data(file_orig, label_col_index=4, cols_to_encode=[1,2,3])

	# Initialize GBDT object
	gbdtDiab = gradientBoostedRegressor(file_loc, num_trees=10, random_seed=0, max_depth=3)

	# Train GBDT model
	gbdtDiab.fit(stats=True)

	# Predict target values
	predictions = gbdtDiab.predict()

	# Get stats
	stats = gbdtDiab.get_stats(predictions)
	print(stats)
if __name__ == "__main__":
	run()

dataPrep.py

Class: DataPrep

one_hot_encode(df, cols): One-hot encodes non-numerical columns in a DataFrame
write_data(df, csv_file): Writes the DataFrame to a CSV file
prepare_data(csv_file, label_col_index, cols_to_encode=[], write_to_csv=True): Prepares data by loading a CSV file, one-hot encoding non-numerical columns, and optionally writing prepared data to a new CSV file

decisionTreeClassifier.py

Classes: Utility, DecisionTree, DecisionTreeWithInfoGain

Utility: Utility class for computing entropy, partitioning classes, and calculating information gain
DecisionTree: Decision tree for classification tasks
DecisionTreeWithInfoGain: Extends DecisionTree to use information gain for splitting

decisionTreeRegressor.py

Classes: Utility, DecisionTreeRegressor

Utility: Utility class for computing variance, partitioning classes, and calculating information gain
DecisionTreeRegressor: Decision tree for regression tasks

randomForestClassifier.py | randomForestClassifierPar.py

Classes: RandomForest, RandomForestWithInfoGain, runRandomForest

RandomForest: Custom random forest classifier with bootstrapping and voting mechanisms
RandomForestWithInfoGain: Extends RandomForest to use information gain for splitting
runRandomForest: Functions to run the random forest classifier

randomForestRegressor.py

Classes: RandomForest, runRandomForest

RandomForest: Custom random forest regressor with bootstrapping and aggregation mechanisms
runRandomForest: Functions to run the random forest regressor

gradientBoostedRegressor.py

Class: gradientBoostedRegressor

gradientBoostedRegressor: Gradient boosted decision tree regressor for regression tasks
Attributes: random_seed, num_trees, max_depth, X, y, XX, numerical_cols
Methods: __init__, reset, fit, predict, get_stats
Example Usage: run function to prepare data, train the model, and evaluate performance

runRandomForest.py

Functions to run custom random forest classifier and regressor on various datasets.