Graduate Thesis Defense
dissertation.rtbs.dev
You work at an institution where lots of people author publications…
…sometimes together…
Co-authorship \(\rightarrow\) relationships of colleagues?
Build a network!
How good of a job did you do?
Conditions
This work:
[…] the practice of ignoring measurement error is still mainstream, and robust methods to take it into account are underdeveloped.
– Tiago Peixoto (2018) [1]
Surprisingly, the development of theory and domain-specific applications often occur in isolation, risking an effective disconnect between theoretical and methodological advances and the way network science is employed in practice.
– Peel et al. (2022) [2]
Metrology is more than just “units”. In our context, we want to:
ISO 5725-1 Accuracy (trueness & precision) [3]
SV1XV, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons
Example: measurement Error for Zachary’s Karate Club
It’s effectively a “calibration artefact” \(\rightarrow\) defined as true, with some precision uncertainty.
Edges are often measured indirectly — i.e. lists of co-occurrences (bipartite papers+authors)
What if…
So let’s do a bipartite projection…
…done?
Have you ever had someone added to a paper by someone else?
Almost all paper authors were “asked” by one current author (not cornered by a mob).
A social scientist “on the ground” would likely measure things differently.
- Author (g) asked (e, h, and c) to co-author a paper, all of whom agreed
- Author (i) asked (f and j), but (j) wanted to add (b)’s expertise before writing one of the sections.
- etc.
This is a network of who depends on who!
Organizing current approaches
\[ \small X^{m\times n}= %X(\{1,2,3,4,\cdots\})=\\ \begin{array}{c c} & \begin{array}{cccccccccc} a & b & c & d & e & f & g & h & i & j\\ \end{array} \\ \begin{array}{c c } x_1\\x_2\\x_3\\x_4\\ \vdots \\ x_m\end{array} & \left[ \begin{array}{c c c c c c c c c c} . & . & 1 & . & 1 & . & 1 & 1 & . & . \\ 1 & . & . & . & 1 & 1 & . & . & . & . \\ . & 1 & . & . & . & 1 & . & . & 1 & 1 \\ . & . & . & 1 & 1 & . & . & 1 & . & . \\ &&&& \vdots &&&&& \\ &&&& \vdots &&&&& \end{array} \right] \end{array} \]
\[ \small B^{\omega \times n} = \begin{array}{c c} & \begin{array}{cccccccccc} a & b & c & d & e & f & g & h & i & j\\ \end{array} \\ \begin{array}{c c } e_1\\e_2\\e_3\\e_4\\e_5\\e_6\\e_7\\e_8\\e_9\\e_{10}\\e_{11}\\ \vdots \\e_\omega\\\end{array} & \left[ \begin{array}{cccccccccc} 1 & . & . & . & . & 1 & . & . & . & . \\ 1 & . & . & . & 1 & . & . & . & . & . \\ . & . & . & . & 1 & 1 & . & . & . & . \\ . & . & . & . & . & 1 & . & . & 1 & . \\ . & . & . & . & 1 & . & 1 & . & . & . \\ . & . & . & . & 1 & . & . & 1 & . & . \\ . & 1 & . & . & . & . & . & . & . & 1 \\ . & . & . & . & . & . & . & . & 1 & 1 \\ . & . & 1 & . & . & . & 1 & . & . & . \\ . & . & . & . & . & . & 1 & 1 & . & . \\ . & . & . & 1 & . & . & . & 1 & . & . \\ &&&& \vdots &&&&& \\ &&&& \vdots &&&&& \end{array} \right] \end{array} \]
\[ \hat{B} \leftarrow f(X)\]
A start: what does “incident” mean, statistically? \(\rightarrow\) Markov Random Fields
\[ P(A\cap B |C ) = P(A|C)P(B|C) \implies (A\perp B|C) \implies (A,B)\notin G \]
Structural Constraint Assumptions
Related to partial pooling in hierarchical bayes, sitting between unpooled and fully pooled priors.
Data Sampling Assumption: model vs. data
Data is noisy, warped, etc. …what makes it that way?
Given the data space observations, we would need the inverse map \(F^{-1}(\mathbf{x})\),
uncertainty is related to precision and trueness.
If we have model space observations (data assumed to match theoretical process),
uncertainty is in precision only.
Data Availability: observations vs. projection
Does our method start with the full bipartite dataset?
Or is it post-processing from marginal counts alone?
We would like a method with:
Can we use co-occurrence counts for association measures? Are we:
[…] interested primarily in whether people know one another, and collaboration on any topic is a reasonable indicator of acquaintance.
Newman & Girvan (2004)
…i.e. assuming model space observations?
If we’re not so explicit, are we:
“So, consiously or unconsciously, many ethnologists studying social organization make what might be called the ‘gambit of the group’: they assume that animals which are clustered […] are interacting with one another and then use membership of the same cluster […] to define association.”
– Whitehead & Dufault (1999) [5]
“Group-based methods” later cited as possible reason for bias in assortativity in social networks [6].
The Gram matrix of co-occurrence underlies so many methods, but what is is saying?
a sum of cliques \(\rightarrow\) clique bias
This framing lets practitioners do a “sanity-check”
Do clique observations make sense for my community?
From a scaling standpoint:
So…
Bigger papers have quadratically more information about connectivity in our social network?
Don’t bigger papers tell us less about who “knows” who?
IDEA: why limit ourselves to cliques?
Recall
If we were a social scientist “on the ground”:
\[ \small %X(\{1,2,3,4,\cdots\})=\\ R^{n\times\omega}= \begin{array}{c c}\small & \begin{array}{ccccccccccc} e_1 & e_2 & e_3 & e_4 & e_5 & e_6 & e_7 & e_8 & e_9 & e_{10} & e_{11}\\ \end{array} \\ \begin{array}{c c } x_1\\x_2\\ \vdots \end{array} & \left[ \begin{array}{ccccccccccc} \ .\; & \;. \; & \;. \; & \;. \; & \;1 \; & \;. \; & \;. \; & \;. \; & \;1 \; & \;1 \; & . \ \\ \ .\; & \;. \; & \;. \; & \;1 \; & \;. \; & \;. \; & \;1 \; & \;1 \; & \;. \; & \;. \; & . \ \\ &&&& \vdots &&&&& \end{array} \right] \end{array} \]
Idea:
Principled way to “sum” over \(\mathbf{x}_i\)?
dankeck, CC0, via Wikimedia Commons
Complex Networks as latent models. [2], [7]
\(\rightarrow\) We can’t know if a link is “real”, because “real” is hard to define!
Instead: “would some expert reasonably answer YES vs NO?”
Alternate framing: would paving a desire path make more sense than not paving?
What prior?
Presented with two equal-length desire paths:
\[ \begin{gathered} \pi_e \sim \text{Beta}(\alpha, 1-\alpha), e\in E\\ E \sim \text{Bernoulli}(\pi_e), e \in E \end{gathered} \]
what class of subgraphs do we choose for \(\mathbf{r}_i\)?
Trees
Recovery from random walks in near-linear time
Random walk model of node activation:
Visiting a node leads to its activation
In our case two pieces of information would have been censored:
What do we know?
Assume:
The global network might not be a tree, but …
The activation dependencies for each random walk must be.
Constrain papers \(\mathbf{r}_i\) to be dependency trees
In our model, being on the same paper is being on the same tree in a forest.
\[\mathbf{r}_i \gets f(\mathbf{x}_i)\]
What are we doing here? … Sparse Approximation?
\[ \mathbf{\hat{r}} = \operatorname*{argmin}_{\mathbf{r}}{\|\mathbf{x}-B^T\mathbf{r} \|_2^2} \quad \text{s.t.} \quad \|\mathbf{r}\|_0 = \|\mathbf{x}\|_0 - 1 \qquad(1)\]
Usually solve with matching pursuit[10].
Each iteration selects the atom with the largest inner product \(\langle \mathbf{b}_{i'},\mathbf{x}\rangle\).
But this doesn’t work (in the standard plus-times ring, see tropical factorization [11]).
\(B^TR\) has the shape and sparsity of \(X\), where each entry is the node’s degree in its tree.
Instead, we will take an Empirical Bayes approach:
\[ \mathbf{\hat{r}} = \operatorname*{argmax}_{\mathbf{r}}{\mathcal{L}(\mathbf{r}|\mathbf{x})} \quad \text{s.t.}\quad \mathbf{r}\sim \mathcal{T}(G^*[\mathbf{x}]) \]
Our point estimate is the mode of the spanning tree distribution on activated nodes
…i.e. the Maximum Spanning Tree (MST)
Prim’s: each iteration selects the edge with the smallest distance to the current tree
does it work?
Measurement Error in Network Dependency Reconstruction
Simulation Study & Test-bench
Community tooling for community problem-solving
MENDR
Dataset)
parameters | values |
---|---|
random graph kind | Tree, Block, BA\((m\in\{1,2\}\)) |
network \(n\)-nodes | 10,30,100,300 |
random walks | 1 sample \(m\sim\text{NegBinomial}(2,\tfrac{1}{n})+10\) |
random walk jumps | 1 sample \(j\sim\text{Geometric}(\tfrac{1}{n})+5\) |
random walk root | 1 sample \(n_0 \sim \text{Multinomial}(\textbf{n},1)\) |
random seed | 1, 2, … , 30 |
Algorithm | abbrev. | \(\alpha\)? | class | source |
---|---|---|---|---|
Forest Pursuit | FP | Yes | local | - |
GLASSO | GL | No | global | [16], [17] |
Ochiai Coef. | CS | Yes | local | [18] |
Hyperbolic Projection | HYP | No | local | [19] |
Doubly-Stochastic | eOT | Yes | resource | [20], [21] |
High-Salience Skeleton | HSS | Yes | resource | [22] |
Resource Allocation | RP | Yes | resource | [23] |
How does “more data” impact the network estimate quality?
Because of how MENDR samples random walks, we need to de-correlate those trends
How fast is it?
Theoretically, Forest Pursuit is:
Empirically:
See full document for more detail :)
Les Misérables
Network of character co-occurrence by chapter.
Centrality for Fantine, Eponine, and Cosette much higher with FP
Why were they so low in the first place? See:
Estimating character importance despite systematic interaction-reduction bias from authors.
It’s still easier to explain the occurrence of many characters because of key female characters, despite their overall lower co-occurrence.
Animal Concept Map
Study participants list out members of some category as fast as possible in a time limit:
Based only on these lists, and the proximity between the words in them, let’s try to make a “map” of animals according to people’s associations in memory.
Using windowed co-occurrences to reflect working memory limitations [26]…
How efficiently can we represent these networks? (range between tree and complete-graph!)
How diverse are edge types when investigation the ways humans connect animals?
For each of the following, assume filtering of edges until the graph would otherwise be disconnected (unless specified otherwise).
Not very efficient; main connectivity/community is context/location-based.
FP is by far the more efficient in edge usage while maintaining connectivity.
An analyst would also have an easier job of creating “edge-type inventories”, making the FP backbone an excellent exploratory assistant: animals can be related because they are:
Multiple Sources & Hidden Nodes
New applications & case studies
For generality, say a practitioner records their measurements as scalar values, i.e., \(x\in\mathbb{S}\in\{\mathbb{R,Z,N},\cdots\}\). Data can be represented as a “design matrix” \(X:\mathbb{S}^{m\times n}\)
\[\begin{gathered} x=X(i,j)\qquad X : I\times J \rightarrow \mathbb{S} \\ i\in I=\{1,\cdots,m\}, \quad j\in J=\{1,\cdots,n\},\qquad I,J:\mathbb{N} \end{gathered}\]
Marginal sums give co-occurrences, matrix products give co-occurrences, etc.
We are interested in the feature relationships (author-author), i.e. feature/representation learning when \(\mathbb{S} \equiv \mathbb{B}\)
Unifying graphs and design matrices: edge-centric parameterizations!1
Note 1: Ochiai Coefficient (Cosine)
Effectively an uncentered correlation, but for binary observations the “cosine similarity” is also called the Ochiai Coefficient between two sets \(A,B\), where binary “1” stands for an element belonging to the set.[18] \[ \frac{|A \cap B |}{\sqrt{|A||B|}}=\sqrt{p_{1\bullet}p_{\bullet 1}} \rightarrow \frac{X^TX}{\sqrt{\mathbf{s}_i\mathbf{s}_i^T}}\quad \mathbf{s}_i = \sum_i \mathbf{x}_i \]
Note 2: Hyperbolic Projection
Attempts to account for the overcounting of co-occurrences on frequently occurring nodes, vs. rarer ones.[19] \[ X^T\text{diag}(1+\mathbf{s}_j)^{-1}X \quad \mathbf{s}_j = \sum_j{\mathbf{x}_j'}\] This re-weights each observation by its degree in the original bipartite graph.
See also: Yule’s Q/Y, Mutual Information, odds ratios, etc.
Note 3: Chow-Liu Spanning Tree
Enforces tree structure globally. Approximates a joint probability \[ P\left(\bigcap_{i=1}^n v_i\right) \approx P' = \prod_{e\in T} P(u_n(e)|v_n(e)) \quad T\in \mathcal{T} \] where \(\mathcal{T}\) is the set of spanning trees for the nodes. The Chow-Liu tree minimizes the Kullback-Leibler (KL) Divergence \(\text{KL}(P \| P')\) by finding the minimum spanning tree over pairwise mutual information weights.[32]
Note 4: GLASSO
Semidefinite program to find (regularized) maximum likelihood precision of graph-structured multivariate Normal distribution using \(\ell_1\) (“LASSO”) penalty [16]
\[ \operatorname*{argmax} \mathcal{L}(\Theta|\hat{\Sigma}) = \operatorname*{argmin}_{\Theta \prec 0}\ \text{tr}(\hat{\Sigma} \Theta) - \log |\Theta|+ \rho\|\Theta\|_1 \qquad(2)\]
In the binary case Equation 2 is still guaranteed to find the structure of the generating MRF, but only for graphs with singleton separators, as shown in [17].
Note 5: Resource Allocation
Goes one step further than hyperbolic projection, by viewing each node as having some “amount” of a resource to spend, which gets re-allocated by observational unit. [23] \[ \text{diag}(\mathbf{s}_i)^{-1}X^T\text{diag}(\mathbf{s}_j)^{-1}X \] i.e. two-step random-walk normalization of the bipartite adjacency matrix.
First \(X\) is row-normalized, then column-normalized.
Note 6: Doubly Stochastic
If \(A\in \mathbb{S}^{n\times n}\) is positive, then there exists \(d_1,d_2\) such that \[W=\text{diag}(d_1)A\text{diag}(d_2)\] is doubly-stochastic, and \(W(u,v)\) is the optimal transport plan between \(u\) and \(v\) with regularized Euclidean distance between them on a graph.[21], [33]
The doubly-stochastic filter [20] removes edges from \(W\) until just before the graph would become disconnected.
Note 7: High-Salience Skeleton
Count the number of shortest-path trees an edge participates in, out of all the shortest-path-trees (one for every node). \[ \frac{1}{n}\sum_{i=1}^n \mathcal{T}_{\text{sp}}(i) \] where \(\mathcal{T}_{\text{sp}}(n)\) is the shortest-path tree rooted at \(n\) [22]
So, each \(\mathbf{x}_i\) will induce a subgraph \(g_i = G^*[V_i]\), where \(V_i = \{v\in \mathcal{V} | X(i,\mathcal{V})=1\}\).
\[ \begin{gathered} \mathcal{C}_i = \{(V,E,B_i) \in\mathcal{C}|B_i(e,v)=B^*(e,v)\mathbf{1}_{V_i}(v)\mathbf{1}_{E_i}(e)\}\\ E_i\in\{\mathcal{E}\in\mathcal{P}(E)| g_i[\mathcal{E}]\in\mathcal{C}\} V_i = \{v\in \mathcal{V} | X(i,\mathcal{V})=1\} \end{gathered} \]
In other words, edges that “caused” to the node activations in a given observation
The regularized Laplacian \(Q=(I+\beta L)^{-1}\)
IDEA: our data is measuring \(\lim_{n\rightarrow \infty} X^TX \propto Q\), s.t. each \(\mathbf{x}\) comes from a tree
Let’s fix our APS performance!
Completely closes the APS gap, while sacrificing MCC/F-M!
Figure 33: P-R curves for two experiments
Random spanning trees on an augmented graph are precisely the random spanning forests on the original.
Marking a single node can “root” a tree in the forest.
Leads to generative model for multi-output binary activations on a graph:
\[ \begin{aligned} \pi_{e\in E} &\sim \operatorname{Beta}_E(\alpha, 1-\alpha) \\ z_{e\in E} &\sim \operatorname{Bernoulli}_E(\pi_e) \\ \phi_{i\in I, j\in J} &\sim \operatorname{Categorical}_I(\theta) \\ x_{i\in I,j\in J} &\sim \operatorname{RSFm}_K(\beta, z_e, \phi_n) \\ \end{aligned} \]
\[ P(x_{ij}|z_E, \phi_{ij}) = \sum_{T\in\mathcal{T}_{+R}}P(x_{ij}| T,\phi_{ij}) P(T|z_E) \]
Is the same as:
\[ P(x_{ij}|z_E, \phi_{ij}) = \sum_{F\in\mathcal{F}}P(x_{ij}| F,\phi_{ij}) P(F|z_E) \]
\[ \begin{gathered} P(F|z_E) = \frac{1}{Z_\mathcal{F}} \mu(F)\\ Z_\mathcal{F} = \det{(I+\beta L)} \end{gathered} \]
Likelihood: \(\sum_\mathcal{F} P(x_{ij}|F,\phi_{ij})\) is the probability a node occurs on the same subtree as a “marked” on
Alternate between estimating recovered network \(B\) and representation of data in edge-space \(R\):
Small but consistent improvement in reconstruction performance across MENDR dataset.
On a per-edge basis, (heavily CV-regularized) logistic regression study shows
Most of the runtime penalty appears to occur from network size (perhaps the \(\|\cdot\|_{\infty}\) norm?)
Walber, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons
Various goals when balancing true/false-positives and true/false-negatives:
Note 8: Precision (P)
Fraction of positive predictions that are true: \[P= \frac{TP}{TP+FP}\]
also called “positive predictive value” (PPV)
Note 9: Recall (R)
Fraction of true values that were returned as positive. \[R=\frac{TP}{TP+FN} \]
Inherent trade-off with precision.
Also called the TP-rate (TPR)
Each metric compares a (binary) prediction to the (binary) ground truth.
Let’s find the “expected value” over all thresholds!
Note 10: Matthews Correlation Coefficient (MCC)
Balances all of TP,TN,FP,FN.
\[\frac{TP\cdot TN - FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]
Preferred for class-imbalanced problems (like sparse recovery) [35]
Note 11: Fowlkes-Mallows (F-M)
Geometric mean of Precision and Recall
\[\sqrt{P\cdot R}\]
Limit of the MCC as TN approaches infinity[36]
Note 12: Average Precision Score (APS)
Expected precision over the possible recall values \[\text{APS} = \sum_{e=1}^{\omega} P(e)(R(e)-R(e-1))\] approximates the integral under the parametric P-R curve
Recreating dataset from Newman & Girvan (2004) [37]
How does degree distribution and assortativity (node-neighbor-degree correlation) change?
Directed Acyclic Graphs
Even our original data, if observed by the social scientist, is a “poset”
In that case we did not ultimately get orderings from the papers, but other domains do!
Technical Language Processing
Word order in sentences, tagging, etc.
We want to organize tags/concepts using the way they occur in, say, Maintenance Work Orders (MWOs)
Other than Bag-of-Words (co-occurrence), common tool is order-\(n\) _Markov Language Model
\[ P_{MC}(t_i|T) \approx P(t_i | t_{i-1}, \cdots,t_{i-n}) \]
However, evidence points to memore recall being more like a random walk that reports initial-visits only.
{visibility=“uncounted”}
In Sexton & Fuge (2020)[41], and [42] we demonstrate the usefulness of accounting for this.
Compare label propagation on each network vs expert subsystem annotation in Excavator MWOs [43]
Figure 44: Semisupervised MWO Tag Classification with Network Label Propagation
Still, nested stochastic gradient descent is slow. Can we use Forest Pursuit in a similar case?