Boston Housing

Predicting the median value of owner-occupied homes



243 Submissions

Context and objective

We will be working on a data set that comes from the real estate industry in Boston (US). Your task as a Data Scientist, is to use machine learning techniques to delve into the given data and predict the median value of owner-occupied homes in 1000 USD's. The target variable in this dataset is MEDV and you are given a new unseen test dataset on which you will have to predict the median value.

Evaluation Criteria

Submissions are evaluated using Root-Mean-Squared-Error (RMSE). How do we do it? 

Once you generate and submit the target variable predictions on evaluation dataset, your submissions will be compared with the true values of the target variable. 

The True or Actual values of the target variable are hidden on the DPhi Practice platform so that we can evaluate your model's performance on evaluation data. Finally, a Root-Mean-Squared-Error (RMSE) for your model will be generated and displayed

About the dataset

This database contains 14 attributes. The target variable refers to the median value of owner-occupied homes in 1000 USD's.

To load the training data in your jupyter notebook, use the below command:

import pandas as pd
boston_data  = pd.read_csv("" )

Data Description
  • CRIM: per capita crime rate by town
  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS: proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX: nitric oxides concentration (parts per 10 million)
  • RM: average number of rooms per dwelling
  • AGE: proportion of owner-occupied units built prior to 1940
  • DIS: weighted distances to five Boston employment centres
  • RAD: index of accessibility to radial highways
  • TAX: full-value property-tax rate per 10,000 USD
  • PTRATIO: pupil-teacher ratio by town
  • B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT: lower status of the population (%)
  • MEDV: Median value of owner-occupied homes in 1000 USD's

Evaluation Dataset

Load the evaluation data (name it as 'eval_data'). You can load the data using the below command.

eval_data = pd.read_csv('')

Here the target column is deliberately not there as you need to predict it.


This dataset is adapted from:

Harrison, David; Rubinfeld, Daniel. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management. Volume 5, Issue 1, March 1978, Pages 81-102. Available at Carnagie Mellon University, Statistics and Data Science:


You need to choose a submission file.

File Format

Your submission should be in CSV format.


This file should have a header row called 'prediction'.
Please see the instructions to save a prediction file under the “Data” tab.

To participate in this challenge either you have to create a team of atleast members or join some team