Conclusion & Future Work

Practitioners have long struggled with a lack of techniques for metrological quantification and measurement error handling in network science. There is an ongoing need to specify valid network recovery models—ones that are not only assessed for precision, but designed for trueness. [1] provides a “reclassification” of measurement error in networks, focusing on the true/false positive/negative dichotomies for both edge and node reporting. This implicitly assumes that, like the karate-club graph [2], our observations are of network components (edges/nodes). When the object to be measured is not directly observable (as is the case for recovery from node activations), measurement error can also arise both from model noise sensitivity (lack of precision) and misspecification (lack of trueness). Because of this:

[1]

D. J. Wang, X. Shi, D. A. McFarland, and J. Leskovec, “Measurement error in network data: A re-classification,” Social Networks, vol. 34, no. 4, pp. 396–409, Oct. 2012, doi: 10.1016/j.socnet.2012.01.003.

[2]

W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of Anthropological Research, vol. 33, no. 4, pp. 452–473, Dec. 1977, doi: 10.1086/jar.33.4.3629752.

The practice of ignoring measurement error is still mainstream, and robust methods to take it into account are underdeveloped.

– Tiago Peixoto [3]

[3]

T. P. Peixoto, “Reconstructing networks with unknown and heterogeneous errors,” Phys. Rev. X, vol. 8, no. 4, p. 041011, Oct. 2018, doi: 10.1103/PhysRevX.8.041011.

Discussion and limitations

Much of our discussion on Forest Pursuit and its extensions has revolved around assumptions about data availability and generation. Practitioners often face situations where networks are recovered in essentially “unsupervised” settings, and their data could reasonably be modeled as arising from a spreading process on the graph they wish to recover. However, these assumptions do not hold in general, and it’s worth discussion how they impact the application of Forest Pursuit and where future research could fill in theoretical and practical gaps.

Validation and Network Dynamics

We largely assumed that real-world network recovery is predominantly unsupervised, so that results verification is very difficult in practice. The MENDR dataset provides an initial foray into standardized reference problems, each having a “ground-truth”, but this becomes much more complicated when real-world datasets do not.

One approach to verification would be to do forecasting on dependencies. For collaboration networks, for instance, if two authors publish together, we are assuming they are conditionally dependent on each other (when there are no “hidden” node activations).
These links (the two-author papers) can be used as an incomplete “ground-truth”, so we could theoretically test each algorithm’s ability to predict dependencies when the two-author examples are either held out or predicted using all preceding papers.

One difficulty with this is the sheer number of true-negatives we expect in a sparse graph as the number of nodes increases. In general a sparse connected graph’s edge count only must grow linearly with node count, but the non-edge count grows quadratically. This makes the chance that any two authors with a possible dependency do coauthor exclusively together vanishingly small, in general. Not all conditional dependencies within a department will lead to pair-papers, since some individuals will only publish in larger groups, for instance. To add to the trouble, relationship networks over time have a good chance of being dynamic: people move to new institutions, students graduate, etc. Using predictions of future two-author collaboration will risk running into errors from network dynamics like these. It may be possible to include network dynamics into the relationship inference itself, such as with Dynamic Topic Models [4], but just as before we are left with the difficulty of verifying and validating a (now more complex) unsupervised model.

[4]

D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proceedings of the 23rd international conference on machine learning - ICML ’06, in ICML ’06. ACM Press, 2006, pp. 113–120. doi: 10.1145/1143844.1143859.

[5]

X. Xie, Z. Zhang, T. Y. Chen, Y. Liu, P.-L. Poon, and B. Xu, “METTLE: A METamorphic testing approach to assessing and validating unsupervised machine learning systems,” IEEE Transactions on Reliability, vol. 69, no. 4, pp. 1293–1322, Dec. 2020, doi: 10.1109/tr.2020.2972266.

Within this train-of-thought, however, lies a possibility to create what are called metamorphic tests suitable for verification of unsupervised models [5]. Rather than look for a ground-truth the measure our network against, we can define properties of our network reconstruction that we know must hold given our modeling assumptions or domain knowledge. For the co-author network case, instead of predicting any two-author paper, instead we might construct a list of all student-advisor pairs. Because we know that their relationship is very likely to be dependent, these pairs should be recovered from an algorithm that is reconstructing author dependencies. Algorithms could be tested against each other for performance on these kinds of metamorphic conditions, ensuring correctness in cases we specify. Such metamorphic tests would be a valuable addition to the network reconstruction community, especially if individual domains could compile lists of relevant conditions that should hold in any given reconstruction attempt.

Spreading process assumption

As discussed in Generative Model for Correlated Binary Data, the marked-Random Spanning Forest model was motivated by a need to incorporate prior knowledge about the generating mechanism of our data (namely, spreading processes). Consequently, our validation through synthetic datasets provided in MENDR used random walks to construct node-activations for inferring structure. The results presented here verify Forest Pursuit’s ability to recover structure in this setting, which also helps validate our design given the domain-based constraint (spreading-process generation).

However, other methods do exist for generating (correlated) binary activation data, and adding other types of datasets to the MENDR catalog would provide a mechanism to verify model recovery capability under other generation settings. The addition of thresholded multivariate-normal samples [6] would be a reasonable next-step to increasing the scope of possible verification for thesis algorithms. Validation in these cases would need more theoretical work to provide a framework for understanding the expected behavior of Forest Pursuit under non-spreading assumptions. However, we also hope to see further development of additive/local Desire Path Density models by the community that estimate relationships under other common generative schemes. It is possible that planar graphs and path-graphs are a class that appropriately model MVL (Ising) and Markov (sequential) generation schemes, respectively. More work is needed to show the behavior of Desire Path Densities with other graph classes.

[6]

F. Leisch, A. Weingessel, and K. Hornik, “On the generation of correlated artificial binary data.” Vienna University of Economics; Business, 1998. doi: 10.57938/6884F809-93BC-4497-AB2B-FC1611198F5B.

Modifications and extensions to Forest Pursuit

Desire Path Densities and Forest Pursuit are designed to be adaptable to several modalities of use¹. However, there are key limitations of the model that could be addressed going forward, as well as future research directions inspired by our modeling paradigm.

¹ see for instance Forest Pursuit as Preprocessing

Multiple sources and hidden nodes

One of the key assumptions to make the likelihood of the (marked) Random Spanning Forest tractable is to allow only one “source” node (the random walk starting node), and to sample it from a uniform categorical distribution. However, by explicitly adding a “root” node that is implied by the random forest distribution (see Figure 7.3), we would immediately achieve the possibility for multiple sources. A source would become any node incident to the root in a sampled spanning tree. To prevent every node from being activated, the spanning tree could be “pruned” at some depth away from the root (which is a parameter that we could model with e.g. a Geometric distribution). How many sources get selected in a minimum spanning tree could be controlled through the weights given to edges incident to the root (which, if the reader recalls, is already represented by \(\beta\)). Also possible could be the use of independent Bernoulli activations of nodes as sources (rather than a categorical selector) though heavy regularization would be required to avoid each node always becoming its own source.

Another assumption is that there were no hidden nodes. This is different from false negative nodes, in the sense that here we did not know a node existed at all, but it is a source of dependency nonetheless. This might be explained as seeing a need to add a node so that observed distances between nodes are maximally tree-like. [7] uses a greedy algorithm (TreeRep) to learn a tree structure (hyperbolic metric) that has minimum distortion to a given distance metric. If the distortion is too great, they add a Steiner node and continue. Automatically building a tree and simultaneously adding Steiner nodes that reduce distortion would effectively result in recommendations for adding new nodes to the graph. An analyst might be able to review these suggestions, interactively, to find “hidden” nodes.

[7]

R. Sonthalia and A. C. Gilbert, “Tree! I am no tree! I am a low dimensional hyperbolic embedding,” arXiv; arXiv, arXiv:2005.03847, Oct. 2020. doi: 10.48550/arXiv.2005.03847.

Generalizing inner products on incidences

Speaking of hyperbolic, many authors have recently investigated the usefulness of hyperbolic space for embedding graphs (as vectors) in a way that preserves hierarchies and sparsity naturally [8], [9], [10], [11], [12]. If trees can be embedded losslessly [13] into \(\mathcal{H}^2\)², then traversing a tree as a random walker could be represented as a trajectory in an \(\mathcal{H}^{2+1}\) de-Sitter space. Various techniques exist for finding embeddings of graphs as “causal-sets” in a Lorentzian spacetime [14], and this approach could be combined with a way to smoothly sample a discretized space with (hyperbolic) Voronoi cells [15], [16]

[8]

I. Chami, A. Gu, V. Chatziafratis, and C. Ré, “From trees to continuous embeddings and back: Hyperbolic hierarchical clustering,” in Advances in neural information processing systems, Curran Associates, Inc., 2020, pp. 15065–15076. Accessed: Jul. 19, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/ac10ec1ace51b2d973cd87973a98d3ab-Abstract.html

[9]

O. Ganea, G. Becigneul, and T. Hofmann, “Hyperbolic entailment cones for learning hierarchical embeddings,” presented at the International conference on machine learning, PMLR, Jul. 2018, pp. 1646–1655. Accessed: Feb. 01, 2023. [Online]. Available: https://proceedings.mlr.press/v80/ganea18a.html

[10]

M. Nickel and D. Kiela, “Learning continuous hierarchies in the lorentz model of hyperbolic geometry,” presented at the International conference on machine learning, PMLR, Jul. 2018, pp. 3779–3788. Accessed: Feb. 01, 2023. [Online]. Available: https://proceedings.mlr.press/v80/nickel18a.html

[11]

F. Sala, C. D. Sa, A. Gu, and C. Re, “Representation tradeoffs for hyperbolic embeddings,” presented at the International conference on machine learning, PMLR, Jul. 2018, pp. 4460–4469. Accessed: Feb. 01, 2023. [Online]. Available: https://proceedings.mlr.press/v80/sala18a.html

[12]

L. Linzhuo, W. Lingfei, and E. James, “Social centralization and semantic collapse: Hyperbolic embeddings of networks and text,” Poetics, vol. 78, p. 101428, Feb. 2020, doi: 10.1016/j.poetic.2019.101428.

[13]

R. Sarkar, “Low distortion delaunay embedding of trees in hyperbolic plane,” in Graph drawing, M. van Kreveld and B. Speckmann, Eds., in Lecture notes in computer science. Berlin, Heidelberg: Springer, 2012, pp. 355–366. doi: 10.1007/978-3-642-25878-7_34.

² \(\mathcal{H}^2\) is the 2D hyperbolic manifold, embedded in a 3D Euclidean space.

[14]

J. R. Clough and T. S. Evans, “Embedding graphs in lorentzian spacetime,” PLOS ONE, vol. 12, no. 11, p. e0187301, Nov. 2017, doi: 10.1371/journal.pone.0187301.

[15]

F. Nielsen and R. Nock, “Hyperbolic voronoi diagrams made easy,” in 2010 international conference on computational science and its applications, Mar. 2010, pp. 74–80. doi: 10.1109/ICCSA.2010.37.

[16]

R. T. Q. Chen, B. Amos, and M. Nickel, “Semi-discrete normalizing flows through differentiable tessellation,” in Advances in neural information processing systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., Curran Associates, Inc., 2022, pp. 14878–14889. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/5f61939af1699c82dab00ed36c887968-Paper-Conference.pdf

Application areas and case studies

Lastly, a critical test of Forest Pursuit will be its application to a wider variety of domains under the scrutiny of each domain’s experts. From the network community itself, for instance, recent interest has been shown in assessing the methodological reasons for observed assortativity [17]. The conditions under which controlling for clique-bias also reduces assortativity would be a useful tool when deciding to use different network cleaning techniques.

[17]

D. N. Fisher, M. J. Silk, and D. W. Franks, “The perceived assortativity of social networks: Methodological problems and solutions,” in Trends in social network analysis, Springer International Publishing, 2017, pp. 1–19. doi: 10.1007/978-3-319-53420-6_1.

[18]

T. J. Prescott, L. D. Newton, N. U. Mir, P. W. R. Woodruff, and R. W. Parks, “A new dissimilarity measure for finding semantic structure in category fluency data with implications for understanding memory organization in schizophrenia.” Neuropsychology, vol. 20, no. 6, pp. 685–699, Nov. 2006, doi: 10.1037/0894-4105.20.6.685.

[19]

N. Arias-Trejo et al., “Semantic verbal fluency: Network analysis in alzheimer’s and parkinson’s disease,” Journal of Cognitive Psychology, vol. 33, no. 5, pp. 557–567, Jun. 2021, doi: 10.1080/20445911.2021.1943414.

Further afield, semantic Verbal Fluency tests discussed in Forest Pursuit Animal Network are often administered for the purposes of assisting in diagnosis of Alzheimer’s and Schizophrenia patients. While experiments using inferred network structures [18], [19] have been used to detecting early-onset neurological disease from topological differences, it could be useful to re-assess these outcomes when clique-bias has been better accounted for.

All of these applications would be further assisted by Forest Pursuit’s ability to

Infer network estimates quickly, on streaming data, and
Incorporate prior (and incoming) knowledge from domain experts on edges.

Together, these properties should make for an ideal human-in-the-loop analysis tool. Indeed, for any qualitative study on nodes and discovering relationships between them, decreasing the annotation load (number of edges to assess)—while increasing edge diversity—will be critical to correctly inventory the important categories of relationships or dependencies³. Building tools that enable practitioners and researchers to undertake complex grounded coding [20] tasks like this is a rich area for possible Human-Systems Integration efforts going forward [21], [22].

³ see Edge Connective Efficiency and Diversity

[20]

J. Saldaña, “The coding manual for qualitative researchers,” 2021.

[21]

J. F. Fung et al., Human-in-the-loop technical document annotation :: Developing and validating a system to provide machine-assistance for domain-specific text analysis. National Institute of Standards; Technology (U.S.), 2024. doi: 10.6028/nist.tn.2287.

[22]

C. Harper, A. Kumer, S. Stuart, and E. Meszaros, “AI-Informed Approaches to Metadata Tagging for Improved Resource Discovery,” in The Rise of AI: Implications and Applications of Artificial Intelligence in Academic Libraries, in PIL, no. 78., ACRL, 2022. Accessed: Mar. 08, 2024. [Online]. Available: https://www.choice360.org/libtech-insight/using-ai-for-metadata-tagging-to-improve-resource-discovery/

Summary of contributions

To develop the practice of taking measurement error into account, we have proposed a combination of problem framing, measurement aggregation techniques, and methods for bias correction when recovering network structure from observed random walk visits.

This thesis provides:

An interpretation of network recovery as an inverse problem comparable to sparse approximation, with concise definitions of data, edge, and node vector spaces from an underlying incidence-structure formalism.
A taxonomy of structural assumptions used in literature to make network inference tractable:
1. Local, global, or resource/information-flow structural constraints;
2. Inverse-problem status (direct or indirect edge observation); and
3. Whether activation observations are pre-aggregated before estimation.
A method, Forest Pursuit, to address the need for a model with:
1. Local observation constraints,
2. An inverse-problem assumption for indirect edge observation, and
3. Dependence on the bipartite nature of node observations.
A dataset and benchmarking toolkit (MENDR) to reproducibly compare algorithmic ability to recover network structure from random walk activations, which has been applied to demonstrate the scaling and accuracy of Forest Pursuit over other methods.
Generalization of Forest Pursuit by developing a probabilistic model for it as a sparse dictionary learning technique, for which we provide an expectation maximization scheme to estimate.
Application of Forest Pursuit as case studies in scientific collaboration networks, classic literature analysis, technical language processing, and semantic verbal fluency tests.

We lay this foundation in the hope of further improving the ability of practitioners to explore the structure of their data in a principled manner.