Motivating example: discovering mathematical topics

\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

7.1. Motivating example: discovering mathematical topics#

A common task in network analysis is to identify “central” vertices in a graph. Centrality is a vague concept. It can be defined in many different ways depending on the context and the type of network. Quoting from Wikipedia:

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. […] Centrality indices are answers to the question “What characterizes an important vertex?” The answer is given in terms of a real-valued function on the vertices of a graph, where the values produced are expected to provide a ranking which identifies the most important nodes. The word “importance” has a wide number of meanings, leading to many different definitions of centrality.

In an undirected graph, a natural approach is to look at the degree of a vertex as a measure of its importance (also referred to as degree centrality). But it is hardly the only one. One could for instance look at the average distance to all other nodes (its reciprocal is the closeness centrality) or at the number of shortest paths between pairs of vertices going through the vertex (known as betweenness centrality).

What if the graph is directed? Things are somewhat more complicated there. For instance, there is now the in-degree as well as the out-degree.

Let us look at a particular example of practical importance, the World Wide Web (from now on, the Web). In this case, the vertices are webpages and a directed edge from \(u\) to \(v\) indicates a hyperlink from page \(u\) to page \(v\). The Web is much too large to analyze here. Instead, we will consider a tiny (but still interesting!) subset of it, the pages of Wolfram’s MathWorld, a wonderful mathematics resource.

Each page of MathWorld concerns a particular mathematical concept, e.g., scale-free network. A definition and notable properties are described. Importantly for us, in a section entitled “SEE ALSO”, other related mathematical concepts are listed with a link to their MathWorld page. In the case of scale-free networks, the small world network topic is referenced, among others.

The resulting directed graph is available through the NetSet datasets and can be downloaded here. We load it now. For convenience, we have reformatted it into the files mathworld-adjacency.csv and mathworld-titles.csv, which are available on the GitHub of the book.

data_edges = pd.read_csv('mathworld-adjacency.csv')
data_edges.head()
from to
0 0 2
1 1 47
2 1 404
3 1 2721
4 2 0

It consists in a list of directed edges. For example, the first one is an edge from vertex 0 to vertex 2. The second one is from 1 to 47 and so on.

There is a total of \(49069\) edges.

data_edges.shape[0]
49069

The second file contains the titles of the pages.

data_titles = pd.read_csv('mathworld-titles.csv')
data_titles.head()
title
0 Alexander's Horned Sphere
1 Exotic Sphere
2 Antoine's Horned Sphere
3 Flat
4 Poincaré Manifold

So the first edge above is from Alexander's Horned Sphere to Antoine's Horned Sphere. That is, the latter is listed in the “SEE ALSO” section of the former.

There are \(12362\) topics.

n = data_titles.shape[0]
print(n)
12362

We construct the graph by adding the edges one by one. We first convert df_edges into a Numpy array.

edgelist = data_edges[['from','to']].to_numpy()
print(edgelist)
[[    0     2]
 [    1    47]
 [    1   404]
 ...
 [12361 12306]
 [12361 12310]
 [12361 12360]]
G = nx.empty_graph(n, create_using=nx.DiGraph)
for i in range(edgelist.shape[0]):
    G.add_edge(edgelist[i,0], edgelist[i,1])

Returning to question of centrality, we can now try to measure the importance of different nodes. For instance, the in-degree of Alexander's Horned Sphere is:

G.in_degree(0)
5

while that of Antoine's Horned Sphere is:

G.in_degree(2)
1

suggesting that the former is more central than the latter, at least in the sense that it is referenced more often.

But is that the right measure? Consider the following: Antoine's Horned Sphere receives only one reference, but it is from a seemingly relatively important vertex, Alexander's Horned Sphere. How can one take this into account in quantifying its importance in the network?

We will come back to this question later in this chapter. To hint at things to come, it will turn out that “exploring the graph at random” provides a powerful perspective on centrality.