Data Sprint #8: Audit Data
What Is an Audit?
The term audit usually refers to a financial statement audit. A financial audit is an objective examination and evaluation of the financial statements of an organization to make sure that the financial records are a fair and accurate representation of the transactions they claim to represent.
Exhaustive one-year non-confidential data in the year 2015 to 2016 of firms is collected from the Auditor Office of India.
The goal here is to help the auditors by building a classification model that can predict the fraudulent firm on the basis of present and historical risk factors.
Submissions are evaluated using F1 Score.
How do we do it?
Once you generate and submit the target variable predictions on the testing dataset, your submissions will be compared with the true values of the target variable.
The True or Actual values of the target variable are hidden on the DPhi platform so that we can evaluate your model's performance on unseen data. Finally, an F1 score for your model will be generated and displayed.
Start Date: 2nd October 2020, 21:00 hours IST / 17:30 hours CET (please locate your time here)
End Date: 5th October 2020, 21:00 hours IST / 17:30 hours CET (please locate your time here)
About the Data
The information about the sectors and the counts of firms are listed respectively as Irrigation (114), Public Health (77), Buildings and Roads (82), Forest (70), Corporate (47), Animal Husbandry (95), Communication (1), Electrical (4), Land (5), Science and Technology (3), Tourism (1), Fisheries (41), Industries (37), Agriculture (200)
To load the training data in your jupyter notebook, use the below command:
import pandas as pd
audit_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/audit_data/training_set_label.csv" )
Many risk factors are examined from various areas like past records of audit office, audit-paras, environmental conditions reports, firm reputation summary, on-going issues report, profit-value records, loss-value records, follow-up reports etc. After in-depth interviews with the auditors, important risk factors are evaluated and their probability of existence is calculated from the present and past records.
Some of the columns/features are:
Sector_score: sector score of the firm
LOCATION_ID: location id of the firm
score_X: different types of score values
risk_X: different types of risk levels
Inherent_Risk: the risk posed by an error or omission in a financial statement due to a factor other than a failure of internal control.
CONTROL_RISK: Control Risk is the risk of a material misstatement in the financial statements arising due to absence or failure in the operation of relevant controls of the entity.
Detection_Risk: Detection Risk is the risk that the auditors fail to detect a material misstatement in the financial statements.
Audit_Risk: Audit risk (AR) refers to the risk that an auditor may issue an unqualified report due to the auditor's failure to detect material misstatement either due to error or fraud. This risk is composed of Inherent risk (IR), Control risk (CR), and Detection risk (DR)
Audit risk can be calculated as:
AR = IR × CR × DR
Money_Value: Value for money of an audit
Risk: Whether the firm is fraudulent or not. The target value
Feel free to use Google for some of the terms that you don't understand.
Load the test data (name it as test_data). You can load the data using the below command.
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/audit_data/testing_set_label.csv')
Here the target column is deliberately not there as you need to predict it.
Saving Prediction File & Sample Submission
You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548
Sample submission: You should submit a CSV file with a header row and the sample submission can be found below.
Note that the header name should
prediction else it will through evaluation error
This data has been sourced from the UCI Machine Learning Repository.
To participate in this challenge either you have to create a team of atleast None members or join some team