Data Sprint #19: Classification of Microorganisms of Sukhna and Dhanas Lakes
Microscopic organisms, commonly known as microorganisms or microbes, are found all around us and even inside our bodies.
Sukhna and Dhanas Lakes are located in Chandigarh, India. Different microorganisms are found in the lakes of Sukhna and Dhanas. We have provided you with morphological features that map the body structure for 4 different classes of microorganisms.
You are required to build a machine learning model to predict the class of given microorganism on the basis of their morphological features.
Submissions are evaluated using Weighted F1 Score.
How do we do it?
Once we release the data, anyone can download it, build a model, and make a submission. We give competitors a set of data (training data) with both the independent and dependent variables.
We also release another set of data (test dataset) with just the independent variables, and we hide the dependent variable that corresponds with this set. You submit the predicted values of the dependent variable for this set and we compare it against the actual values.
The predictions are evaluated based on the evaluation metric defined in the datathon.
About the Data
The dataset consists of morphological features that map the body structure for four different classes of microorganisms. These microorganisms are found in the lakes of Sukhna and Dhanas, Chandigarh, India. The images of microorganisms were captured by taking microscopic images of whole mounted glass slides. Following are some of the features with their descriptions.
- Solidity: It is the ratio of area of an object to the area of a convex hull of the object. Computed as Area/ConvexArea.
- Eccentricity: The eccentricity is the ratio of length of major to minor axis of an object.
- EquivDiameter: Diameter of a circle with the same area as the region.
- Extrema: Extrema points in the region. The format of the vector is [top-left top-right right-top right-bottom bottom-right bottom-left left-bottom left-top].
- Filled Area: Number of on pixels in FilledImage, returned as a scalar.
- Extent: Ratio of the pixel area of a region with respect to the bounding box area of an object.
- Orientation: The overall direction of the shape. The value ranges from -90 degrees to 90 degrees.
- Euler number: Number of objects in the region minus the number of holes in those objects.
- Bounding box: Position and size of the smallest box (rectangle) which bounds the object.
- Convex hull: Smallest convex shape/polygon that contains the object.
- Major axis: The major axis is the endpoints of the longest line that can be drawn through the object. Length (in pixels) of the major axis is the largest dimension of the object.
- Minor axis: The axis perpendicular to the major axis is called the minor axis. Length (in pixels) of the minor axis is the smallest line connecting a pair of points on the contour.
- Perimeter: Number of pixels around the border of the region.
- Centroid: Centre of mass of the region. It is a measure of the object's location in the image.
- Area: Total number of pixels in a region/shape.
- microorganism: The class of microorganisms, the target variable
To load the training data in your jupyter notebook, use the below command:
import pandas as pd
mo_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/sukhna_dhanas/train_set_label.csv" )
Saving Prediction File & Sample Submission
You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548
Sample submission: You should submit a CSV file with a header row and the sample submission can be found below
Note that the header name should be prediction else it will throw an evaluation error. A sample submission file can be found here
Load the test data (name it as test_data). You can load the data using the below command.
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/sukhna_dhanas/test_set_label.csv')
Here the target column is deliberately not there as you need to predict it
The dataset is sourced from Mendeley data.
Dhindsa, Anaahat ; Bhatia , Sanjay; Agrawal, Sunil; sohi, bs (2020), “Classification of Microorganisms of Sukhna and Dhanas Lakes”, Mendeley Data, V2, doi: 10.17632/bcnv3n43wg.2
To participate in this challenge either you have to create a team of atleast None members or join some team