Weblog Analysis for Predicting Correlations in Stock Price Evolutions

Project Overview

In this project we use data extracted from many weblogs to identify the underlying relations of a set of companies in the S&P 500 index. For this, we define a pairwise similarity measure for the companies based on the weblogs articles and then perform the clustering. We show that it is possible to capture some interesting relations between companies with our clustering. As an application of this clustering we propose a cluster-based portfolio selection method which combines the weblogs data and historical stock prices. Finally, by performing some simulations, we show that our method performs better (in terms of risk measures) than the ones which are based on the sectors of the companies or the historical stock prices.

Publication:

Milad Kharratzadeh, Mark Coates, "Weblog Analysis for Predicting Correlations in Stock Price Evolutions", in Proc. Int. AAAI Conf. on Weblogs and Social Media (ICWSM), Dublin, Ireland, Jun. 2012


 

Blog-based Clustering of Companies

 

Method:

  • Data collection and pre-processing: data from more than 130,000,000 weblogs, only keeping the contents of the articles (removing all the ads and metadata, etc.)
  • Building a coappearance matrix: defining the similarity measure as the number of mutual appearances in blogs' articles
  • Applying GANC clustering algorithm: adding up the coappearance matrices for nine days (Jan 13 - Jan 21), and then apply GANC (a graph clustering algorithm that aims to minimize the normalized cut criterion)

 

 

   

 

Results:

  • Total number of nodes (companies): 342
  • Total number of clusters: 24 (chosen the same as the number of S&P 500 subsectors)
  • Overview of the clusters shown in Fig. 1:
    • Each node, one cluster
    • Size of each node: number of companies in that cluster
    • Thickness of the edges: sum of the weights of the edges between the companies of the two clusters

 

Highlights of the clusters:

  • Cluster 22: strongest ties with other clusters, consists of companies mainly in the field of "Information Technology" and "Telecommunications" (Amazon, Apple, Cisco, Google, Intel, Microsoft, Yahoo, etc.)
  • Clusters 18 and 19: strong ties with other clusters, consist of companies mainly in "Consumer Discretionary" and "Consumer Staples" sectors (3M, Coach, Coca Cola, Home Depot, Nike, Ralph Lauren, Safeway, Starbucks, Walmart, eBay, Abercrombie, DIRECTV, Kodak, Kohl's, Oce Depot,     Staples, etc.)
  • Cluster 1 (Allegheny, FirstEnergy): both utilities companies, merged on Feb 25th 2011 (one month after collection of data)
  • Cluster 4: cluster of financial companies (Bank of America Corp., Chicago Mercantile,
    Citigroup, Comerica, Goldman Sachs, JPMorgan, Morgan Stanley, Robert Half,
    State Street, T. Rowe, U.S. Bancorp, etc.)  
  • Cluster 6 (Duke Energy, Progress Energy): both energy companies, merged on Jan 10th (right before the start date of collecting data)     
  • Cluster 9 (Halliburton, Transocean): the two companies that were involved in the Gulf oil spill disaster along with BP, blamed by the presidential commission 
  • Cluster 13: cluster of health care companies (Allergan, Amgen, Boston Scientifi c, Bristol-Myers, Medtronic, Merck, Pfi zer, and Stryker)

Figure 1 - Overview of the clusters

 

Portfolio-selection Method Based on Blogs And Historical Prices Data

 

  • Stock market prices are a ected by business fundamentals, company and world events, human psychology, and much more.
  • These factors can be captured by analysing blogs as well as the historical stock prices
  • Portfolio-selection: investing in a collection of companies (called portfolio) rather than individual companies, thus reduced risk
  • Idea: selecting portfolios using the clusters formed based on weblogs and historical stock prices

Results:

In comparison to other basic methods, our method performs better in terms of risk measures. In the following figures, the risk is plotted for a toy example of investing 100$ in 24 companies over a period of 100 days, based on three different methods.

People:

Supervisor: Prof. Mark Coates

MEng Student: Milad Kharratzadeh