About the Data

The information about the sectors and the counts of firms are listed respectively as Irrigation (114), Public Health (77), Buildings and Roads (82), Forest (70), Corporate (47), Animal Husbandry (95), Communication (1), Electrical (4), Land (5), Science and Technology (3), Tourism (1), Fisheries (41), Industries (37), Agriculture (200)

To load the training data in your jupyter notebook, use the below command:

import pandas as pd

audit_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/audit_data/training_set_label.csv" )

Data Description

Many risk factors are examined from various areas like past records of audit office, audit-paras, environmental conditions reports, firm reputation summary, on-going issues report, profit-value records, loss-value records, follow-up reports etc. After in-depth interviews with the auditors, important risk factors are evaluated and their probability of existence is calculated from the present and past records.

Some of the columns/features are:

Sector_score: sector score of the firm
LOCATION_ID: location id of the firm
score_X: different types of score values
risk_X: different types of risk levels
Inherent_Risk: the risk posed by an error or omission in a financial statement due to a factor other than a failure of internal control.
CONTROL_RISK: Control Risk is the risk of a material misstatement in the financial statements arising due to absence or failure in the operation of relevant controls of the entity.
Detection_Risk: Detection Risk is the risk that the auditors fail to detect a material misstatement in the financial statements.
Audit_Risk: Audit risk (AR) refers to the risk that an auditor may issue an unqualified report due to the auditor's failure to detect material misstatement either due to error or fraud. This risk is composed of Inherent risk (IR), Control risk (CR), and Detection risk (DR)

Audit risk can be calculated as:

AR = IR × CR × DR
Money_Value: Value for money of an audit
Risk: Whether the firm is fraudulent or not. The target value

Feel free to use Google for some of the terms that you don't understand.

Test Dataset

Load the test data (name it as test_data). You can load the data using the below command.

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/audit_data/testing_set_label.csv')

Here the target column is deliberately not there as you need to predict it.

Saving Prediction File & Sample Submission

You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548

Sample submission: You should submit a CSV file with a header row and the sample submission can be found below.

prediction 0 1 1 0
Etc.

Note that the header name should `prediction` else it will through evaluation error

Acknowledgement

This data has been sourced from the UCI Machine Learning Repository.

Data Sprint #8: Audit Data

Challenge Starts

Registration Ends

Challenge Ends