1. Text based network industries and endogenous product differentiation

1 Text based network industries and endogenous product differentiation

1.1 summary:

product similarity industries classification.

This paper develops new time-varying industry classification using text-based analysis of firm product description filed with the SEC.

empirical benefits

product differentiation, competitive intensity and product offering following industry shocks.

1.2 ideas:

product words describing features and bundles of products of each firm.
how similar each firm is to every other firm by calculating firm-by-firm pairwise word similarity scores using the product words.

1.3 Objective and Methodology: From Words to Industry Classifications

1.3.1 Objective:

To capture the relatedness of firms based on their product offerings to customers using a flexible network approach(cosine similarity method), using the clustering methods to classify a industry.

Fixed industry classification.

Firms are grouped together using fixed product market definitions and industry membership is constraint to be transitive.

Text-based network industry classification(TNIC).

Explain differences in key characteristics such as profitability, sales growth, and market risk across industries.

It allows both within-industry and across-industry relations be to examined.

1.3.2 Methods:

computing pairwise word similarity scores for each pair of firms in a given year.

data:
get product descriptions.
limit attention to nouns(defined by Webster.com) and proper nouns that appear in no more than 25% of all product descriptions in order to avoid common words.
omit common words that are used by more than 25% of all firms, omit geographical words including country and state names, as well as top fifty cities in the US and in the world.
algorithm

Clustering Based On Distance Matrix

Hierarchical clustering

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

\[ V_i = \frac {P_i}{\sqrt {P_i* P_i}}\] \[\text {Product Cosine Similarity}_{i,j} = (V_i * V_j)\] \[\text {Product Cosine Distance}_{i,j} = 1- (V_i * V_j)\]

industry classification
- fixed industry classification based on 10-Ks product descriptions
maintain consistency with other fixed classifications such as Standard Industry Classification.

running a clustering algorithm.

hold these industries fixed throughout samples(from 1997 to 2008).

assign firms to these industries in later years based on their 10-K text similarity relative to the frequency-weighted list of words. clustering test
- 10-K based TNIC
define each firmi's industry to include all firms j with pairwise cosine similarities relative to i above a pre-specified minimum threshold.

focusing on thresholds generating industries with the same fraction of membership pairs as SIC-3 industries in an unbiased fashion.

\[ \begin{pmatrix} a_{11} & \cdots & a_{1w}\\ \vdots & \ddots & \vdots\\ a_{i1} & \cdots & a_{iw} \end{pmatrix} \]