A hospital in USA commited to use the power of Data Science in their hospital in order to reduce some burden on their doctors. They organized a Data Science competition on DPhi to hire some the best Data Scientist. As a data science lover you thought to participate in the competition.
The dataset provided in the competition is a real dataset. This dataset is related to Breast Cancer which was collected in their hospital.
About Breast Cancer:
Breast cancer is a type of cancer that starts in the breast. Cancer starts when cells begin to grow out of control. Breast cancer cells usually form a tumor that can often be seen on an x-ray or felt as a lump. Breast cancer occurs almost entirely in women, but men can get breast cancer, too.
A benign tumor is a tumor that does not invade its surrounding tissue or spread around the body. A malignant tumor is a tumor that may invade its surrounding tissue or spread around the body.
You are required to determine if the cancer is Malignant or Benign.
Submissions are evaluated based on the Accuracy Score calculated using the predicted value of your model and true value of the 'diagnosis' on the evaluation dataset mentioned under submission guidelines.
Submissions are evaluated using Accuracy Score. How do we do it?
Once you generate and submit the target variable predictions on evaluation dataset, your submissions will be compared with the true values of the target variable.
The True or Actual values of the target variable are hidden on the DPhi Practice platform so that we can evaluate your model's performance on unseen data. Finally, an Accuracy score for your model will be generated and displayed.
About the dataset
Different features related to the breast are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.
To load the dataset in your jupyter notebook, use the below command:
import pandas as pd breast_cancer_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/breast_cancer/Training_set_breastcancer.csv')
- id: Id number
- diagnosis: Cancer is Malignant or Benign (M = malignant, B = benign) - target variable
- Other 20 features contain information about following 10 real valued features a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)
Feel free to google things which you don't understand.
Load the evaluation (name it as 'breast_cancer_eval'). You can load the data using the below command.
breast_cancer_eval = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/breast_cancer/Testing_set_breastcancer.csv')
Data reference is from the UCI Machine Learning Repository
To participate in this challenge either you have to create a team of atleast members or join some team