Table of Contents

1 Text based network industries and endogenous product differentiation

1.1 summary:

  • product similarity industries classification.

This paper develops new time-varying industry classification using text-based analysis of firm product description filed with the SEC.

  • empirical benefits

product differentiation, competitive intensity and product offering following industry shocks.

1.2 ideas:

  • product words describing features and bundles of products of each firm.
  • how similar each firm is to every other firm by calculating firm-by-firm pairwise word similarity scores using the product words.

1.3 Objective and Methodology: From Words to Industry Classifications

1.3.1 Objective:

To capture the relatedness of firms based on their product offerings to customers using a flexible network approach(cosine similarity method), using the clustering methods to classify a industry.

  1. Fixed industry classification.

Firms are grouped together using fixed product market definitions and industry membership is constraint to be transitive.

  1. Text-based network industry classification(TNIC).

Explain differences in key characteristics such as profitability, sales growth, and market risk across industries.

It allows both within-industry and across-industry relations be to examined.

1.3.2 Methods:

computing pairwise word similarity scores for each pair of firms in a given year.

  • data:
  • get product descriptions.
  • limit attention to nouns(defined by Webster.com) and proper nouns that appear in no more than 25% of all product descriptions in order to avoid common words.
  • omit common words that are used by more than 25% of all firms, omit geographical words including country and state names, as well as top fifty cities in the US and in the world.
  • algorithm

Clustering Based On Distance Matrix

  1. Hierarchical clustering

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

\[ V_i = \frac {P_i}{\sqrt {P_i* P_i}}\] \[\text {Product Cosine Similarity}_{i,j} = (V_i * V_j)\] \[\text {Product Cosine Distance}_{i,j} = 1- (V_i * V_j)\]

  1. industry classification
    • fixed industry classification based on 10-Ks product descriptions

    maintain consistency with other fixed classifications such as Standard Industry Classification.

    running a clustering algorithm.

    hold these industries fixed throughout samples(from 1997 to 2008).

    assign firms to these industries in later years based on their 10-K text similarity relative to the frequency-weighted list of words. clustering test

    • 10-K based TNIC

    define each firmi's industry to include all firms j with pairwise cosine similarities relative to i above a pre-specified minimum threshold.

    focusing on thresholds generating industries with the same fraction of membership pairs as SIC-3 industries in an unbiased fashion.

    \[ \begin{pmatrix} a_{11} & \cdots & a_{1w}\\ \vdots & \ddots & \vdots\\ a_{i1} & \cdots & a_{iw} \end{pmatrix} \]