The data contains industry trends, investment insights and individual company information. There are 48 columns/features. Some of the features are:

age_first_funding_year – quantitative
age_last_funding_year – quantitative
relationships – quantitative
funding_rounds – quantitative
funding_total_usd – quantitative
milestones – quantitative
age_first_milestone_year – quantitative
age_last_milestone_year – quantitative
state – categorical
industry_type – categorical
has_VC – categorical
has_angel – categorical
has_roundA – categorical
has_roundB – categorical
has_roundC – categorical
has_roundD – categorical
avg_participants – quantitative
is_top500 – categorical
status(acquired/closed) – categorical (the target variable, if a startup is ‘acquired’ by some other organization, means the startup succeed)

To load training dataset, use below command:

import pandas as pd

To load the testing dataset, use below command:

Load the test data (name it as test_data). You can load the data using the below command.

Here the target column is deliberately not there as you need to predict it.

We would like to thank Ramkishan Panthena, for providing us this dataset. He is a Machine Learning Engineer at GMO.

Data Sprint #5: Startup Success Prediction