Measuring Network Dependencies from Node Activations
This thesis develops a method for inferring dependency networks from binary data by describing and accounting for sources of systematic bias. Node “activation” data is a common source of complex networks, such as scientific collaboration networks from publications, to keyword and tag co-occurrence graphs from documents, epidemiological networks from infection records, and customer product recommendations from buyer carts, to name only a few. However,the way these networks get created often lead to dense edge “hairballs” and heavy post-processing needs. To more accurately reconstruct dependency networks, we propose a flexible technique that generalizes co-occurrence counting, and allows domain-knowledge to inform why nodes activate (and how to use that for inference).
Using our technique, we obtain “Desire Path Density” estimates of latent networks: families of possible networks based on probability distributions over individual edge activations, rather than node co-occurrences. We show that co-occurrence counting is a special case of this estimatation technique—and that it assumes node activations arise exclusively from cliques. This assumption leads to “clique-bias” in the recovered graphs, which results in the unwanted hairball effect. Our method, called Forest Pursuit, is a different application of desire path density estimation that attempts to address this bias. It assumes node activations are generated by sampling from distributions over dependency trees.
Forest Pursuit outperforms other algorithms at accurate network recovery, is scalable to very large networks sizes, and can be executed in a streaming or parallel manner over observations. It approximates Steiner trees for each set of activated nodes before aggregating them to estimate the latent global network. We additionally extend Forest Pursuit to be a probabilistic model whose parameters may be infered through Expectation Maximization. This modification shows improved performance over one-shot Forest Pursuit, at the cost of computation time.
Our introductory chapters provide a unified framework to analyze node activation data and edge (dependency) occurrences in related vector spaces, and provide a taxonomy of network recovery assumption types so that practitioners can find and describe sources of systematic bias from those assumptions. Then, along with derivations for Desire Path Density estimation and Forest Pursuit, we verify our method and compare its performance agains others. We created a dataset of synthetic experiments (MENDR), made publically available as a reference and test-bench for the community. Finally, we use several case studies to demonstrate the impacts of using Forest Pursuit for network analysis in a realistic setting,with insights into the topological differences between networks inferred by other methods and ours. By using Forest Pursuit, practitioners can correct for clique-bias and better use their latent networks to uncover important behavior of complex systems.
Foreward
DISCLAIMER FOR PRIOR WORK
Portions of Latent Graphs with Desire Paths, Approximate Recovery in Near-linear Time by Forest Pursuit, Modifications & Extensions, along with network figures from Qualitative Application of Relationship Recovery and Recovery from Working Memory & Partial Orders, are planned for submission to PLOS Complex Systems later this year.
Approximate Recovery in Near-linear Time by Forest Pursuit references MENDR
, a standard reference dataset and test-bench for network recovery algorithms created in the course of completing this thesis. It is awaiting DOI assignment through NIST internal review processes. Likewise for affinis
, a python library for dependency inference from binary activations containing our reference implementations of various algorithms (including Forest Pursuit).
Portions of Recovery from Working Memory & Partial Orders involving INVITE and recovery from partially ordered node sets are taken from [1] in the Journal of Mechanical Design, and its predecessor [2] from the 45th ASME Design Automation Conference.