Introduction
A wide variety of fields show consistent interest in inferring latent network structure from observed interactions, from human cognition and social infection networks, to marketing, traffic, finance, and many others. [1] However, an increasing number of authors are noting a lack of agreement in how to approach the metrology of this problem. This includes rampant disconnects between the theoretical and methodological network analysis sub-communities[2], treatment of error as purely aleatory, rather than epistemic [3], or simply ignoring measurement error in network reconstruction entirely[4]. This thesis builds on recent methodological recommendations for increased focus on how dependencies should play a central role in network analysis [5], and facilitating a paradigm shift toward network analysis as inference of an inverse problem [2].
Ambiguous Metrology
Networks in the “wild” rarely exist of and by themselves. Rather, they are a model of interaction or relation between things that were observed. One of the most beloved examples of a network, the famed Zachary’s Karate Club[6], is in fact reported as a list of pairwise interactions: every time a club member interacted with another (outside of the club), Zachary recorded it as two integers (the IDs of the members). The final list of pairs can be interpreted as an “edge list”, which can be modeled with a network: a simple graph. This was famously used to show natural community structure (Figure 1.1) that nicely matches the group separation that eventually took place when the club split into two.[7]
Note, however, that we could have just as easily taken note of the instigating student for each interaction (i.e., which student initiated conversation, or invited the other to socialize, etc.). If that relational asymmetry is available, our “edges” are now directed, and we might be able to ask questions about the rates that certain students are asked vs. do the asking, and what that implies about group cohesion. Additionally, the time span is assumed to be “for the duration of observation” (did the students ever interact), but if observation time was significantly longer, say, multiple years, we might question the credulity of treating a social interaction 2 years ago as equally important to an interaction immediately preceding the split. This is now a “dynamic” graph; or, if we only measure relative to the time of separation, at the very least a “weighted” one.
This observation raises an interesting metrology problem: We do not know if any of these are true. “Metrology” is not limited to physical units, like “meters” and “grams”, but more generally is concerned with systematic quantification with uncertainty. Units provide a natural framework to describe what a metrologist is usually after: not just “how much”, but “how accurately and precisely that much”, as well. When we use “metrology” in the context of network analysis, we are specifically referring to the need to:
- Quantify a network
- Consider the trueness of that quantification
- Consider the precision of that quantification.
The difference between trueness and precision is a crucial, often overlooked distinction: how close a set of measurements are to a reference value vs. how repeatable/reproducible a measurement is[8].
The metrological questions we posed above are ones of _trueness_we have no way to tell if Zachary’s network model is specified correctly, because the reference network “type” is under-defined and we have no networks to compare it with. We simply have to take the network as a reference unto itself; it is a calibration artefact, much like a physical “meter rod”. However, even with an assumed perfect “trueness”, precision is often an issue as well! In fact, as illustrated in Figure 1.1, we do not know if the network being described from the original edge data even has 77 or 78 edges, due to ambiguous reporting in the original work. Lacking a precise definition of what the graph’s components (i.e., it’s edges) are, as measurable entities, means we cannot estimate the accuracy of the graph, whether for trueness or precision.
Indirect Network Measurement
While the karate club graph has unquantified edge uncertainty derived from ambiguous edge measurements, we are fortunate that we have edge measurements. Regardless of how the data was collected, it is de facto reported as a list of pairs, which lends itself to treatment as a reference artefact. In many cases, we simply do not have such luxury.
Instead, edges are often measured indirectly, and instead we are given lists of node co-occurrences. Networks connecting movies as being “similar” might be derived from data that lists sets of movies watched by each user; networks of disease spread pathways might be implied from patient infection records; famously, we might build a network of collaboration strength between academic authors by mining datasets of the papers they co-author together.
Such networks are derived from what we will call node activation data, i.e., records of what entities happened “together”, whether contemporaneously, or in some other context or artifact. For this class, precision might be easy to assess, having oft-repeated activations.
These networkx are naturally represented as “bipartite” networks, having separate entites for, say, “papers” and “authors”, and connecting them with edges (paper 1 is “connected” to its authors E,H,C, etc.). But analysts are typically seeking the collaboration network connecting authors (or papers) themselves! Networks of relationships in this situation are not directly observed, but which if recovered could provide estimates for community structure, importances of individual authors (e.g. as controlling flow of information), and the “distances” that separate authors from each other, in their respective domains. [9] Common practice assumes that co-authorship in any paper is sufficient evidence of at least some level of social “acquaintance”, where more papers shared means more “connected”.
Thus our social collaboration network in Figure 1.3 (a) is borne out of indirect measurements: author connection is implied through “occasions when co-authorship occurred”. However, authors of papers may recall times that others were added, not by their choice, but by someone else already involved. In fact, the final author list of most papers is reasonably a result of individuals choosing to invite others, not a unanimous, simultaneous decision by all members. Let’s imagine we wished to study the social network of collaboration more directly: if we had the luxury of being in situ as, say, a sociologist performing an academic ethnography, we might have been more strict with our definition of “connection”. If the goal is a meaningful social network reflecting the strength of interraction between colleagues, perhaps we prefer that our edges represent “mutual willingness to collaborate”. Edge “measurement”, then, could involve records of events that show willingness to seek or participate in collaboration event, such as:
- Author (g) asked (e), (h), and (c) to co-author a paper, all of whom agreed
- (i) asked (f) and (j), but (j) wanted to add (b)’s expertise before writing one of the sections
- etc.
Each time two colleagues had an opportunity to work together and it was seized upon we might conclude that evidence of their relationship strengthened. With data like this, we could be more confident in claiming our collaboration network can serve as “ground truth,” as far as empirically confirmed collaborations go. However, even if the underlying “activations” are identical, our new, directly measured graph looks very different.
Fundamentally, the network in Figure 3.1 (b) shows which relationships the authors depend on to accomplish their publishing activity. When causal relations between nodes are being modeled as edges, we call such a graph a dependency network. We will investigate this idea further later on, but ultimately, if a network of dependencies is desired (or implied, based on analysis needs), then the critical problem remaining is how do we recover dependency networks from node activations? What is missing, once again, is any sense of reference value to base our assessment of trueness on. This thesis is primarily concerned with a metrological need within the network analysis community to have terms and techniques for describing and dealing with this problem. What goes wrong when we use co-occurrence/activation data to estimate the dependency network? What goes wrong when we wish to use co-occurrences for metrics like centrality and assortativity, or for exploratory analyses like building relationship type inventories?
Even more practically, networks created directly from bipartite-style data are notorious for quickly becoming far too dense for useful analysis, earning them the (not-so-)loving moniker “hairballs”. Network “backboning,” as it has come to be called tries to find a subset of edges in this hairball that still captures its core topology in a way that’s easier to visualize.[10], [11] Meanwhile, underlying networks of dependencies that cause node activation patterns can provide this: they are almost always more sparse than their hairballs. Accessing the dependency backbone in a principled way is difficult, but doing so in a rapid, scalable manner is critical for practitioners to be able to make use of it to trim their hairballs.
Scope of this work
The purpose of this thesis is to provide a solid foundation for edge metrology when our data consists of binary node activations, by framing network analysis as a problem of inference, as suggested by [2]. We give special focus to binary activations that occur due to spreading processes, such as random walks or cascades on an underlying carrier graph. Recovering the carrier, or, “dependency” network from node activations is of great interest to the network backboning and causal modeling communities, but often involves either unspoken sources of epistemic and aleatory error, or high computation costs (or both). To begin addressing these issues, Part I of this thesis presents a guide and review of current practices, some of their pitfalls, and how common statistical tools apply to the network recovery problem: a Practitioner’s Guide to Network Recovery. We will cover what “measurement” means in our context, and specifically the ways we encode observations, operations, and uncertainties numerically. Clarifying what different versions of what “relation” means (whether proximity or incidence) is critical, since network structure is intended to encode such relations as mathematical objects (despite common ambiguities and confusion around what practitioners intend on communicating through them). Then we organize a literature review to present a cohesive framework for assessing network recovery techniques, based on the assumptions and compromises being made to make the network inference problem tractable.
Next, building on a gap found in the first part, Part II presents a novel method, Forest Pursuit, to extract dependency networks when we know a spreading process causes node activation (e.g. paper co-authorship caused by collaboration requests). We formalize a generic foundation for representing inferred networks as unions of observed subgraphs (Latent Graphs with Desire Paths), which we term Desire Path Density estimates. A closed form for Desire Path tree densities leads to the Forest Pursuit algorithm, (Approximate Recovery in Near-linear Time by Forest Pursuit) which scales linearly with the size of active-node sets, is trivially parallelizable and streamable over dataset size, and is agnostic to network size. We create a new reference dataset to enable community benchmarking of network recovery techniques, and use it show greatly improved accuracy over many other widely-used methods. We then extend Forest Pursuit (Modifications & Extensions) as a Bayesian probabilistic model, for which we present an Expectation Maximization scheme for posterior estimation. This improves upon the accuracy results of Forest Pursuit, at the cost of some speed and scalability, giving analysts multiple options to adapt to their needs.
Last, in Part III we apply Forest Pursuit to several qualitative case-studies. We reconstruct scientific collaboration dataset to re-assess properties of the inferred network, and again with the classic Les Miserables character co-occurrences (Qualitative Application of Relationship Recovery). Then, we investigate network dependency recovery when partial-order information for co-occurrences are available, such as with text analysis (Recovery from Working Memory & Partial Orders), and test Forest Pursuit on a classic verbal fluency “animals” network recovery problem. Finally, we discuss more broadly the future needs of network recovery with Forest Pursuit (Conclusion & Future Work), specifically in the context of human-in-the-loop relationship annotation, hyperbolic graph embeddings, and gradient-based machine learning toolkits.
Overview
In summary, the remainder of this thesis provides the following:
- Metrology with matrices and Vector representations of incidence defines common operations on binary observations, and builds a unified representation of both dependencies and co-occurrences as incidence structures and vectors.
- Roads to Network Recovery reviews current literature, and organizes it into a useful framework for needs assessment and future work.
- Latent Graphs with Desire Paths generalizes co-occurrence estimation to incorporate a priori domain information as Desire Path Density estimates of networks.
- Approximate Recovery in Near-linear Time by Forest Pursuit presents a new scalable algorithm for dependency recovery and validates it on a new testbench for network recovery problems
- Modifications & Extensions builds on Forest Pursuit to improve its performance on certain metrics, and re-formulates it as a probabilistic model.
- Qualitative Application of Relationship Recovery and Recovery from Working Memory & Partial Orders present case studies for applying forest pursuit to network analysis problems without available ground-truth networks.