Datathon

Ended

Data Sprint #21: Classification of Malware with PE headers

Classify the malware

Easy

|

701 Submissions

Content

Malicious program or malware is an intentionally written program to indulge in various malicious activities, ranging from user’s information stealing to cyber-espionage. The behavioral dynamism exposed by the malware is dependent on various factors such as, nature of the attack, sophisticated technology and the rapid increase in exploitable vulnerabilities. Malware attacks also increased along with the rapid growth in the use of digital devices and the internet. The exponential increase in the creation of new malware in the last five years, made malware detection as a challenging research issue.

Malware detection is the technique for identifying malware in the end devices or networks.


Problem Statement

Malware is one of the top most obstructions for expansion and growth of digital acceptance among the users. Both enterprises and common users are struggling to get protected from the malware in cyberspace, which emphasizes the importance of developing efficient methods of malware detection.


Objective

You are required to build a machine learning model to classify a sample as benign - 0 or malware - 1.


Evaluation Criteria

Submissions are evaluated using F1 Score.

How do we do it? 

Once we release the data, anyone can download it, build a model, and make a submission. We give competitors a set of data (training data) with both the independent and dependent variables. 

We also release another set of data (test dataset) with just the independent variables, and we hide the dependent variable that corresponds with this set. You submit the predicted values of the dependent variable for this set and we compare it against the actual values. 

The predictions are evaluated based on the evaluation metric defined in the datathon.


 

The baseline notebook is available here.

About the Data

Dataset is related to Portable Executable files for malware detection. There are 55 features in the dataset (excluding target variable). The features consist of 19 image dos headers, 7 file headers and 29 optional headers.

IMAGE_DOS_HEADER (19)

  • "e_magic", "e_cblp", "e_cp","e_crlc","e_cparhdr",
  • "e_minalloc","e_maxalloc","e_ss","e_sp",
  • "e_csum","e_ip","e_cs","e_lfarlc","e_ovno","e_res",
  • "e_oemid","e_oeminfo","e_res2","e_lfanew"

FILE_HEADER (7)

  • "Machine","NumberOfSections","CreationYear","PointerToSymbolTable",
  • "NumberOfSymbols","SizeOfOptionalHeader","Characteristics"

OPTIONAL_HEADER (29)

  • "Magic", "MajorLinkerVersion", "MinorLinkerVersion", "SizeOfCode", "SizeOfInitializedData", 
  • "SizeOfUninitializedData", "AddressOfEntryPoint",
  • "BaseOfCode", "BaseOfData", "ImageBase", "SectionAlignment", "FileAlignment",
  • "MajorOperatingSystemVersion", "MinorOperatingSystemVersion",
  • "MajorImageVersion", "MinorImageVersion", "MajorSubsystemVersion",
  • "MinorSubsystemVersion", "SizeOfImage", "SizeOfHeaders", "CheckSum",
  • "Subsystem", "DllCharacteristics", "SizeOfStackReserve", "SizeOfStackCommit",
  • "SizeOfHeapReserve", "SizeOfHeapCommit", "LoaderFlags", "NumberOfRvaAndSizes"

TARGET_VARIABLE: class - 0 (benign), 1 (malware)

The first field, e_magic, is the so-called magic number. This field is used to identify an MS-DOS-compatible file type. 

Feel free to Google the terms you don’t know. 


To load the training data in your jupyter notebook, use the below command:

import pandas as pd

train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/clamp/train_set_label.csv")


Saving Prediction File & Sample Submission

You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548

Sample submission: You should submit a CSV file with a header row and the sample submission can be found below

prediction

1

0

0

1

1

0

.

.

Etc.

Note that the header name should be ‘prediction’ else it will throw an evaluation error. A sample submission file can be found here


Test Dataset

Load the test data (name it as test_data). You can load the data using the below command.

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/clamp/test_set_label.csv')

Here the target column is deliberately not there as you need to predict it


Acknowledgement

The data is sourced from Mendeley data.

Kumar, Ajit (2020), “ClaMP (Classification of Malware with PE headers)”, Mendeley Data, V1, doi: 10.17632/xvyv59vwvz.1

Read Paper: "A learning model to detect maliciousness of portable executable using integrated feature set", authored by  Ajit Kumar, K.S.Kuppusamy, and G.Aghila.

loading...

You need to choose a submission file.

File Format

Your submission should be in CSV format.

Predictions

This file should have a header row called 'prediction'.
Please see the instructions to save a prediction file under the “Data” tab.

To participate in this challenge either you have to create a team of atleast None members or join some team