Conclusion & Future Work
Practitioners have long struggled with a lack of techniques for metrological quantification and measurement error handling in network science. There is an ongoing need to specify valid network recovery models—ones that are not only assessed for precision, but designed for trueness. [1] provides a “reclassification” of measurement error in networks, focusing on the true/false positive/negative dichotomies for both edge and node reporting. This implicitly assumes that, like the karate-club graph [2], our observations are of network components (edges/nodes). When the object to be measured is not directly observable (as is the case for recovery from node activations), measurement error can also arise both from model noise sensitivity (lack of precision) and misspecification (lack of trueness). Because of this:
The practice of ignoring measurement error is still mainstream, and robust methods to take it into account are underdeveloped.
– Tiago Peixoto [3]
Discussion and limitations
Much of our discussion on Forest Pursuit and its extensions has revolved around assumptions about data availability and generation. Practitioners often face situations where networks are recovered in essentially “unsupervised” settings, and their data could reasonably be modeled as arising from a spreading process on the graph they wish to recover. However, these assumptions do not hold in general, and it’s worth discussion how they impact the application of Forest Pursuit and where future research could fill in theoretical and practical gaps.
Validation and Network Dynamics
We largely assumed that real-world network recovery is predominantly unsupervised, so that results verification is very difficult in practice. The MENDR dataset provides an initial foray into standardized reference problems, each having a “ground-truth”, but this becomes much more complicated when real-world datasets do not.
One approach to verification would be to do forecasting on dependencies. For collaboration networks, for instance, if two authors publish together, we are assuming they are conditionally dependent on each other (when there are no “hidden” node activations).
These links (the two-author papers) can be used as an incomplete “ground-truth”, so we could theoretically test each algorithm’s ability to predict dependencies when the two-author examples are either held out or predicted using all preceding papers.
One difficulty with this is the sheer number of true-negatives we expect in a sparse graph as the number of nodes increases. In general a sparse connected graph’s edge count only must grow linearly with node count, but the non-edge count grows quadratically. This makes the chance that any two authors with a possible dependency do coauthor exclusively together vanishingly small, in general. Not all conditional dependencies within a department will lead to pair-papers, since some individuals will only publish in larger groups, for instance. To add to the trouble, relationship networks over time have a good chance of being dynamic: people move to new institutions, students graduate, etc. Using predictions of future two-author collaboration will risk running into errors from network dynamics like these. It may be possible to include network dynamics into the relationship inference itself, such as with Dynamic Topic Models [4], but just as before we are left with the difficulty of verifying and validating a (now more complex) unsupervised model.
Within this train-of-thought, however, lies a possibility to create what are called metamorphic tests suitable for verification of unsupervised models [5]. Rather than look for a ground-truth the measure our network against, we can define properties of our network reconstruction that we know must hold given our modeling assumptions or domain knowledge. For the co-author network case, instead of predicting any two-author paper, instead we might construct a list of all student-advisor pairs. Because we know that their relationship is very likely to be dependent, these pairs should be recovered from an algorithm that is reconstructing author dependencies. Algorithms could be tested against each other for performance on these kinds of metamorphic conditions, ensuring correctness in cases we specify. Such metamorphic tests would be a valuable addition to the network reconstruction community, especially if individual domains could compile lists of relevant conditions that should hold in any given reconstruction attempt.
Spreading process assumption
As discussed in Generative Model for Correlated Binary Data, the marked-Random Spanning Forest model was motivated by a need to incorporate prior knowledge about the generating mechanism of our data (namely, spreading processes). Consequently, our validation through synthetic datasets provided in MENDR used random walks to construct node-activations for inferring structure. The results presented here verify Forest Pursuit’s ability to recover structure in this setting, which also helps validate our design given the domain-based constraint (spreading-process generation).
However, other methods do exist for generating (correlated) binary activation data, and adding other types of datasets to the MENDR catalog would provide a mechanism to verify model recovery capability under other generation settings. The addition of thresholded multivariate-normal samples [6] would be a reasonable next-step to increasing the scope of possible verification for thesis algorithms. Validation in these cases would need more theoretical work to provide a framework for understanding the expected behavior of Forest Pursuit under non-spreading assumptions. However, we also hope to see further development of additive/local Desire Path Density models by the community that estimate relationships under other common generative schemes. It is possible that planar graphs and path-graphs are a class that appropriately model MVL (Ising) and Markov (sequential) generation schemes, respectively. More work is needed to show the behavior of Desire Path Densities with other graph classes.
Modifications and extensions to Forest Pursuit
Desire Path Densities and Forest Pursuit are designed to be adaptable to several modalities of use1. However, there are key limitations of the model that could be addressed going forward, as well as future research directions inspired by our modeling paradigm.
1 see for instance Forest Pursuit as Preprocessing
Generalizing inner products on incidences
Speaking of hyperbolic, many authors have recently investigated the usefulness of hyperbolic space for embedding graphs (as vectors) in a way that preserves hierarchies and sparsity naturally [8], [9], [10], [11], [12]. If trees can be embedded losslessly [13] into \(\mathcal{H}^2\)2, then traversing a tree as a random walker could be represented as a trajectory in an \(\mathcal{H}^{2+1}\) de-Sitter space. Various techniques exist for finding embeddings of graphs as “causal-sets” in a Lorentzian spacetime [14], and this approach could be combined with a way to smoothly sample a discretized space with (hyperbolic) Voronoi cells [15], [16]
2 \(\mathcal{H}^2\) is the 2D hyperbolic manifold, embedded in a 3D Euclidean space.
Application areas and case studies
Lastly, a critical test of Forest Pursuit will be its application to a wider variety of domains under the scrutiny of each domain’s experts. From the network community itself, for instance, recent interest has been shown in assessing the methodological reasons for observed assortativity [17]. The conditions under which controlling for clique-bias also reduces assortativity would be a useful tool when deciding to use different network cleaning techniques.
Further afield, semantic Verbal Fluency tests discussed in Forest Pursuit Animal Network are often administered for the purposes of assisting in diagnosis of Alzheimer’s and Schizophrenia patients. While experiments using inferred network structures [18], [19] have been used to detecting early-onset neurological disease from topological differences, it could be useful to re-assess these outcomes when clique-bias has been better accounted for.
All of these applications would be further assisted by Forest Pursuit’s ability to
- Infer network estimates quickly, on streaming data, and
- Incorporate prior (and incoming) knowledge from domain experts on edges.
Together, these properties should make for an ideal human-in-the-loop analysis tool. Indeed, for any qualitative study on nodes and discovering relationships between them, decreasing the annotation load (number of edges to assess)—while increasing edge diversity—will be critical to correctly inventory the important categories of relationships or dependencies3. Building tools that enable practitioners and researchers to undertake complex grounded coding [20] tasks like this is a rich area for possible Human-Systems Integration efforts going forward [21], [22].
Summary of contributions
To develop the practice of taking measurement error into account, we have proposed a combination of problem framing, measurement aggregation techniques, and methods for bias correction when recovering network structure from observed random walk visits.
This thesis provides:
- An interpretation of network recovery as an inverse problem comparable to sparse approximation, with concise definitions of data, edge, and node vector spaces from an underlying incidence-structure formalism.
- A taxonomy of structural assumptions used in literature to make network inference tractable:
- Local, global, or resource/information-flow structural constraints;
- Inverse-problem status (direct or indirect edge observation); and
- Whether activation observations are pre-aggregated before estimation.
- A method, Forest Pursuit, to address the need for a model with:
- Local observation constraints,
- An inverse-problem assumption for indirect edge observation, and
- Dependence on the bipartite nature of node observations.
- A dataset and benchmarking toolkit (MENDR) to reproducibly compare algorithmic ability to recover network structure from random walk activations, which has been applied to demonstrate the scaling and accuracy of Forest Pursuit over other methods.
- Generalization of Forest Pursuit by developing a probabilistic model for it as a sparse dictionary learning technique, for which we provide an expectation maximization scheme to estimate.
- Application of Forest Pursuit as case studies in scientific collaboration networks, classic literature analysis, technical language processing, and semantic verbal fluency tests.
We lay this foundation in the hope of further improving the ability of practitioners to explore the structure of their data in a principled manner.