Context and objective
We will be working on a data set that comes from the real estate industry in Boston (US). Your task as a Data Scientist, is to use machine learning techniques to delve into the given data and predict the median value of owner-occupied homes in 1000 USD's. The target variable in this dataset is
MEDV and you are given a new unseen test dataset on which you will have to predict the median value.
Submissions are evaluated using Root-Mean-Squared-Error (RMSE). How do we do it?
Once you generate and submit the target variable predictions on evaluation dataset, your submissions will be compared with the true values of the target variable.
The True or Actual values of the target variable are hidden on the DPhi Practice platform so that we can evaluate your model's performance on evaluation data. Finally, a Root-Mean-Squared-Error (RMSE) for your model will be generated and displayed
About the dataset
This database contains 14 attributes. The target variable refers to the median value of owner-occupied homes in 1000 USD's.
To load the training data in your jupyter notebook, use the below command:
import pandas as pd boston_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Boston_Housing/Training_set_boston.csv
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per 10,000 USD
- PTRATIO: pupil-teacher ratio by town
1000(Bk - 0.63)^2where Bk is the proportion of blacks by town
- LSTAT: lower status of the population (%)
- MEDV: Median value of owner-occupied homes in 1000 USD's
Load the evaluation data (name it as '
eval_data'). You can load the data using the below command.
eval_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Boston_Housing/Testing_set_boston.csv
target column is deliberately not there as you need to predict it.
This dataset is adapted from:
Harrison, David; Rubinfeld, Daniel. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management. Volume 5, Issue 1, March 1978, Pages 81-102. Available at Carnagie Mellon University, Statistics and Data Science: http://lib.stat.cmu.edu/datasets/boston.
To participate in this challenge either you have to create a team of atleast members or join some team