Datathon

Ended

Data Sprint #8: Audit Data

Predict the fraudulent firm

Medium

|

117 Submissions

Context

What Is an Audit?

The term audit usually refers to a financial statement audit. A financial audit is an objective examination and evaluation of the financial statements of an organization to make sure that the financial records are a fair and accurate representation of the transactions they claim to represent.

Exhaustive one-year non-confidential data in the year 2015 to 2016 of firms is collected from the Auditor Office of India.


Objective

The goal here is to help the auditors by building a classification model that can predict the fraudulent firm on the basis of present and historical risk factors.


Evaluation Criteria

Submissions are evaluated using F1 Score.

How do we do it? 

Once you generate and submit the target variable predictions on the testing dataset, your submissions will be compared with the true values of the target variable. 

The True or Actual values of the target variable are hidden on the DPhi platform so that we can evaluate your model's performance on unseen data. Finally, an F1 score for your model will be generated and displayed.


Timeline

Start Date: 2nd October 2020, 21:00 hours IST / 17:30 hours CET (please locate your time here)

End Date: 5th October 2020, 21:00 hours IST / 17:30 hours CET (please locate your time here)


The baseline notebook is available here.


About the Data

The information about the sectors and the counts of firms are listed respectively as Irrigation (114), Public Health (77), Buildings and Roads (82), Forest (70), Corporate (47), Animal Husbandry (95), Communication (1), Electrical (4), Land (5), Science and Technology (3), Tourism (1), Fisheries (41), Industries (37), Agriculture (200)

To load the training data in your jupyter notebook, use the below command:

import pandas as pd

audit_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/audit_data/training_set_label.csv" )


Data Description

Many risk factors are examined from various areas like past records of audit office, audit-paras, environmental conditions reports, firm reputation summary, on-going issues report, profit-value records, loss-value records, follow-up reports etc. After in-depth interviews with the auditors, important risk factors are evaluated and their probability of existence is calculated from the present and past records.

 

Some of the columns/features are:

  • Sector_score: sector score of the firm

  • LOCATION_ID: location id of the firm

  • score_X: different types of score values

  • risk_X: different types of risk levels

  • Inherent_Risk: the risk posed by an error or omission in a financial statement due to a factor other than a failure of internal control.

  • CONTROL_RISK: Control Risk is the risk of a material misstatement in the financial statements arising due to absence or failure in the operation of relevant controls of the entity.

  • Detection_Risk: Detection Risk is the risk that the auditors fail to detect a material misstatement in the financial statements.

  • Audit_Risk: Audit risk (AR) refers to the risk that an auditor may issue an unqualified report due to the auditor's failure to detect material misstatement either due to error or fraud. This risk is composed of Inherent risk (IR), Control risk (CR), and Detection risk (DR)

    Audit risk can be calculated as:

    AR = IR × CR × DR

  • Money_Value: Value for money of an audit

  • Risk: Whether the firm is fraudulent or not. The target value

Feel free to use Google for some of the terms that you don't understand.


Test Dataset

Load the test data (name it as test_data). You can load the data using the below command.

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/audit_data/testing_set_label.csv')

Here the target column is deliberately not there as you need to predict it.



Saving Prediction File & Sample Submission

You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548

Sample submission: You should submit a CSV file with a header row and the sample submission can be found below.

prediction
0
1
1
0

Etc.

Note that the header name should prediction else it will through evaluation error

Acknowledgement

This data has been sourced from the UCI Machine Learning Repository.

loading...

You need to choose a submission file.

File Format

Your submission should be in CSV format.

Predictions

This file should have a header row called 'prediction'.
Please see the instructions to save a prediction file under the “Data” tab.

To participate in this challenge either you have to create a team of atleast None members or join some team