Goal

This project is primarily educational, aiming to understand various ensemble learning algorithms by implementing them from scratch using numpy for array processing and custom datasets.

Overview

This project includes implementations of custom random forest classifiers and regressors, as well as a gradient boosted decision tree regressor.

Main Python Files

Included CSV Files

  • Wisconsin_breast_prognostic.csv: Dataset for Wisconsin Breast Prognostic analysis
  • pima-indians-diabetes.csv: Dataset for Pima Indians Diabetes analysis
  • output_May-06-2024_cleaned.csv: Car listing data from "used_car_price_visualization" repository

Example Results

randomForest Cars Regression R2 randomForest Cars Regression MSE

Multiprocessing Performance Gain

randomForestClassifierPar vs randomForestClassifier

randomForestClassifierPar vs randomForestClassifier

Usage Example

This run function in gradientBoostedRegressor.py demonstrates how to prepare data, train the model, and evaluate its performance:


def run():
	"""
	Runs Gradient Boosted Decision Trees on the given dataset.
	"""
	# Source file location
	file_orig = "data/carsDotCom.csv"

	# Prepare and format data
	df, file_loc = dp.DataPrep.prepare_data(file_orig, label_col_index=4, cols_to_encode=[1,2,3])

	# Initialize GBDT object
	gbdtDiab = gradientBoostedRegressor(file_loc, num_trees=10, random_seed=0, max_depth=3)

	# Train GBDT model
	gbdtDiab.fit(stats=True)

	# Predict target values
	predictions = gbdtDiab.predict()

	# Get stats
	stats = gbdtDiab.get_stats(predictions)
	print(stats)
if __name__ == "__main__":
	run()
							

dataPrep.py

Class: DataPrep

  • one_hot_encode(df, cols): One-hot encodes non-numerical columns in a DataFrame
  • write_data(df, csv_file): Writes the DataFrame to a CSV file
  • prepare_data(csv_file, label_col_index, cols_to_encode=[], write_to_csv=True): Prepares data by loading a CSV file, one-hot encoding non-numerical columns, and optionally writing prepared data to a new CSV file

decisionTreeClassifier.py

Classes: Utility, DecisionTree, DecisionTreeWithInfoGain

  • Utility: Utility class for computing entropy, partitioning classes, and calculating information gain
  • DecisionTree: Decision tree for classification tasks
  • DecisionTreeWithInfoGain: Extends DecisionTree to use information gain for splitting

decisionTreeRegressor.py

Classes: Utility, DecisionTreeRegressor

  • Utility: Utility class for computing variance, partitioning classes, and calculating information gain
  • DecisionTreeRegressor: Decision tree for regression tasks

randomForestClassifier.py | randomForestClassifierPar.py

Classes: RandomForest, RandomForestWithInfoGain, runRandomForest

  • RandomForest: Custom random forest classifier with bootstrapping and voting mechanisms
  • RandomForestWithInfoGain: Extends RandomForest to use information gain for splitting
  • runRandomForest: Functions to run the random forest classifier

randomForestRegressor.py

Classes: RandomForest, runRandomForest

  • RandomForest: Custom random forest regressor with bootstrapping and aggregation mechanisms
  • runRandomForest: Functions to run the random forest regressor

gradientBoostedRegressor.py

Class: gradientBoostedRegressor

  • gradientBoostedRegressor: Gradient boosted decision tree regressor for regression tasks
  • Attributes: random_seed, num_trees, max_depth, X, y, XX, numerical_cols
  • Methods: __init__, reset, fit, predict, get_stats
  • Example Usage: run function to prepare data, train the model, and evaluate performance

runRandomForest.py

Functions to run custom random forest classifier and regressor on various datasets.