Datathon

Ended

Data Sprint #11: Online News Popularity

Find the popularity of the news articles

Medium

|

246 Submissions

Context

Reading the newspaper may be a sensible habit. It carries information concerning politics, economy, recreations, sports, business, industries, trade and commerce. This habit will not only enhance your information concerning general information but it will likewise improve your language skills and vocabulary.

An online newspaper is the online version of a newspaper, either as a complete publication or the on-line version of a written periodical. Online news services have several uses, and for this reason, this service encompasses a pile of benefits.


Unlike watching the news on TV, or listening to it on the radio, online news services enable the user to decide on what articles they hear, watch, or read. this can be helpful, as individuals would not “waste their time” on articles that don’t interest them – they now have the control to pick whatever interests them.


Problem Statement

With the zoom of online news services and social media, it's incredibly useful if we could verify readers’ unseen behavioural patterns. In addition to that, it is helpful to shed light on readers’ intentions and to predict the recognition of the internet news, which implies whether the news article will receive a good number of readers' attention. It's vital so as to present pre-info to the media staff (authors, advertisers, etc.) to modify every article in line with its quality with none influence from.


Objective

Imagine you are working as a Data Scientist for an online newspaper. You are required to build a Machine Learning model that will predict the number of shares (popularity) for the given news or article.


Evaluation Criteria

Submissions are evaluated using Root Mean Squared Error (RMSE).

How do we do it? 

Once you generate and submit the target variable predictions on the testing dataset, your submissions will be compared with the true values of the target variable. 

The True or Actual values of the target variable are hidden on the DPhi Practice platform so that we can evaluate your model's performance on testing data. Finally, a Root Mean Squared Error (RMSE) for your model will be generated and displayed.

Update at the end of the data sprint: Earlier the evaluation metric was MSE instead of RMSE due to which the errors were very high. Now the metric has been corrected from MSE to RMSE.


The baseline notebook is available here.

About the Data

This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years.


To load the training data in your jupyter notebook, use the below command:

import pandas as pd

news_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/online_news_popularity/train_set_label.csv" )


Data Description

     0. url: URL of the article

     1. timedelta: Days between the article publication and the dataset acquisition

     2. n_tokens_title: Number of words in the title

     3. n_tokens_content: Number of words in the content

     4. n_unique_tokens:  Rate of unique words in the content

     5. n_non_stop_words: Rate of non-stop words in the content

     6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content

     7. num_hrefs: Number of links

     8. num_self_hrefs: Number of links to other articles published by Mashable

     9. num_imgs: Number of images

    10. num_videos: Number of videos

    11. average_token_length: Average length of the words in the content

    12. num_keywords: Number of keywords in the metadata

    13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?

    14. data_channel_is_entertainment: Is data channel 'Entertainment'?

    15. data_channel_is_bus: Is data channel 'Business'?

    16. data_channel_is_socmed: Is data channel 'Social Media'?

    17. data_channel_is_tech: Is data channel 'Tech'?

    18. data_channel_is_world: Is data channel 'World'?

    19. kw_min_min: Worst keyword (min. shares)

    20. kw_max_min: Worst keyword (max. shares)

    21. kw_avg_min: Worst keyword (avg. shares)

    22. kw_min_max: Best keyword (min. shares)

    23. kw_max_max: Best keyword (max. shares)

    24. kw_avg_max: Best keyword (avg. shares)

    25. kw_min_avg: Avg. keyword (min. shares)

    26. kw_max_avg: Avg. keyword (max. shares)

    27. kw_avg_avg: Avg. keyword (avg. shares)

    28. self_reference_min_shares: Min. shares of referenced articles in Mashable

    29. self_reference_max_shares: Max. shares of referenced articles in Mashable

    30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable

    31. weekday_is_monday: Was the article published on a Monday?

    32. weekday_is_tuesday: Was the article published on a Tuesday?

    33. weekday_is_wednesday: Was the article published on a Wednesday?

    34. weekday_is_thursday: Was the article published on a Thursday?

    35. weekday_is_friday: Was the article published on a Friday?

    36. weekday_is_saturday: Was the article published on a Saturday?

    37. weekday_is_sunday: Was the article published on a Sunday?

    38. is_weekend: Was the article published on the weekend?

    39. LDA_00: Closeness to LDA topic 0

    40. LDA_01: Closeness to LDA topic 1

    41. LDA_02: Closeness to LDA topic 2

    42. LDA_03: Closeness to LDA topic 3

    43. LDA_04: Closeness to LDA topic 4

    44. global_subjectivity: Text subjectivity

    45. global_sentiment_polarity: Text sentiment polarity

    46. global_rate_positive_words: Rate of positive words in the content

    47. global_rate_negative_words: Rate of negative words in the content

    48. rate_positive_words: Rate of positive words among non-neutral tokens

    49. rate_negative_words: Rate of negative words among non-neutral tokens

    50. avg_positive_polarity: Avg. polarity of positive words

    51. min_positive_polarity: Min. polarity of positive words

    52. max_positive_polarity: Max. polarity of positive words

    53. avg_negative_polarity: Avg. polarity of negative  words

    54. min_negative_polarity: Min. polarity of negative  words

    55. max_negative_polarity: Max. polarity of negative  words

    56. title_subjectivity: Title subjectivity

    57. title_sentiment_polarity: Title polarity

    58. abs_title_subjectivity: Absolute subjectivity level

    59. abs_title_sentiment_polarity: Absolute polarity level

    60. shares: Number of shares (target)


Saving Prediction File & Sample Submission

You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548

Sample submission: You should submit a CSV file with a header row and the sample submission can be found below.

prediction

110

45

12

225

Etc.

Note that the header name should be prediction else it will through evaluation error

Test Dataset

Load the test data (name it as test_data). You can load the data using the below command.

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/online_news_popularity/test_set_label.csv')

Here the target column is deliberately not there as you need to predict it


Acknowledgement

This dataset was downloaded from UCI Machine Learning Repository.

loading...

You need to choose a submission file.

File Format

Your submission should be in CSV format.

Predictions

This file should have a header row called 'prediction'.
Please see the instructions to save a prediction file under the “Data” tab.

To participate in this challenge either you have to create a team of atleast None members or join some team