About the Data

Dataset is related to Portable Executable files for malware detection. There are 55 features in the dataset (excluding target variable). The features consist of 19 image dos headers, 7 file headers and 29 optional headers.

IMAGE_DOS_HEADER (19)

"e_magic", "e_cblp", "e_cp","e_crlc","e_cparhdr",
"e_minalloc","e_maxalloc","e_ss","e_sp",
"e_csum","e_ip","e_cs","e_lfarlc","e_ovno","e_res",
"e_oemid","e_oeminfo","e_res2","e_lfanew"

FILE_HEADER (7)

"Machine","NumberOfSections","CreationYear","PointerToSymbolTable",
"NumberOfSymbols","SizeOfOptionalHeader","Characteristics"

OPTIONAL_HEADER (29)

"Magic", "MajorLinkerVersion", "MinorLinkerVersion", "SizeOfCode", "SizeOfInitializedData",
"SizeOfUninitializedData", "AddressOfEntryPoint",
"BaseOfCode", "BaseOfData", "ImageBase", "SectionAlignment", "FileAlignment",
"MajorOperatingSystemVersion", "MinorOperatingSystemVersion",
"MajorImageVersion", "MinorImageVersion", "MajorSubsystemVersion",
"MinorSubsystemVersion", "SizeOfImage", "SizeOfHeaders", "CheckSum",
"Subsystem", "DllCharacteristics", "SizeOfStackReserve", "SizeOfStackCommit",
"SizeOfHeapReserve", "SizeOfHeapCommit", "LoaderFlags", "NumberOfRvaAndSizes"

TARGET_VARIABLE: class - 0 (benign), 1 (malware)

The first field, e_magic, is the so-called magic number. This field is used to identify an MS-DOS-compatible file type.

Feel free to Google the terms you don’t know.

To load the training data in your jupyter notebook, use the below command:

import pandas as pd

train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/clamp/train_set_label.csv")

Saving Prediction File & Sample Submission

You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548

Sample submission: You should submit a CSV file with a header row and the sample submission can be found below

prediction

1

0

0

1

1

0

.

.

Etc.

Note that the header name should be ‘prediction’ else it will throw an evaluation error. A sample submission file can be found here

Test Dataset

Load the test data (name it as test_data). You can load the data using the below command.

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/clamp/test_set_label.csv')

Here the target column is deliberately not there as you need to predict it

Acknowledgement

The data is sourced from Mendeley data.

Kumar, Ajit (2020), “ClaMP (Classification of Malware with PE headers)”, Mendeley Data, V1, doi: 10.17632/xvyv59vwvz.1

Read Paper: "A learning model to detect maliciousness of portable executable using integrated feature set", authored by Ajit Kumar, K.S.Kuppusamy, and G.Aghila.

Data Sprint #21: Classification of Malware with PE headers

Challenge Starts

Registration Ends

Challenge Ends