Online News Popularity Prediction

The Internet is an important tool for sharing messages. A recent survey has shown that teenagers and adults read online news in their daily life. This percentage has increased a lot in the past few years. For reasons that more and more people read online news and editors want their news to be popular it would be meaningful to build a system to predict whether a news will be popular or not. Such system can help editors find how they could improve their news but also can bring significant commercial value. Thus we chose the “online news popularity” data set for capstone project.

Process Flow in Online News Industry

Prior to step 2 in this below diagram its important to check if the article is going to be popular or not.

Process Flow

Problem Statement

Due to the increase in competition among the online news portals, Mashable has been facing a decrease in the number of shares of  its articles on social media which in turn has affected the advertisements featuring on its page.

Objective of this Project

Dataset Description

The dataset summarizes heterogeneous set of features about the articles published by Mashable between 2013 and 2015.

We drop url and timedelta for further analysis since they are non predictive in nature.

Data Dictionary

Below is a brief desrciption of the most important features in the data set: Feature 1

Feature 2

Insights obtained from the DataSet

  1. Keywords tends to attract many people towards reading the article. As the number of keywords in an article increases the shares as well increases.
  2. People tend to share the articles which are having decent amount of words in the title. People don’t appreciate short titles.
  3. Number of images as means of visualization plays a huge role in determining the shares an article since people don’t have the patience and time to watch a video and then share it to others.
  4. People are tending to read articles under the world category since they have a perception they can read and know everything that is happening in the world.
  5. People tend to read a lot of articles on Tuesday and Wednesday because as the weekend approaches, they wish to do some leisure activities and relax may be like watching a movie.
  6. We can see that if the keywords in an article is good, we are getting high amount of shares also if the keywords are average in an article, we are getting average amount of shares and with worst keywords we are getting very less amount of shares.

Dealing with Skewness and Outliers

Skewness

Outliers

Splitting the dataset

Recommendations

References