Recovery from Working Memory & Partial Orders

“So an analytically astute observer would find that a salmon is more closely related to a camel than it is to a hagfish. On what basis, then, do we justify grouping salmon under the label fish, but not camels?”

– Samuel Beswick (sarcastically)

This whole time we have assumed that the ordering of nodes was unknown, or at the very least unreliable. However, there are frequently cases, especially in text processing applications, where we have some sense of an ordering on activations. By partial order on a set (a “poset”), we mean that all elements are either comparable as greater (before) or less (after), or incomparable. The set of posets is therefore precisely isomorphic to the set of directed acyclic graphs, based on reachability.

Our original example with authors might be thought of as a poset: (i) precedes/asks (f) and (j), (j) precedes/asks (b), but (b) and (f) are incomparable. We don’t know if (i) asked (f) before or after (j) asked (b).1 We have updated the original example with explicit partial orders in Figure 9.1. Note that nodes in some lists could be re-arranged while keeping the partial order the same (keeping all arrows pointing to the right).

1  Interestingly, in the case where the node activations are given a total order (in the form of timestamps), [1] derive an algorithm called NetInf. It utilizes sums over minimum spanning trees that satisfy time constraints, similar to a special case of Forest Pursuit where all activations have a total ordering.

[1]
M. Gomez-Rodriguez, J. Leskovec, and A. Krause, “Inferring networks of diffusion and influence,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 5, no. 4, pp. 1–37, 2012, doi: 10.1145/2086737.2086741.
Figure 9.1: Observations as partially-ordered sets

Sadly our co-authorship example does not often include the (partial) order of author additions,2 but other common network recovery problems do have an inherent order to them. A very common need is to recover concept association networks, whether from lists of tags or directly from a corpus of written language. What’s needed is an assumption on how the observed partial order of concepts is generated. [2] proposes a “foraging” mechanism, so that concepts get sequentially recalled from “semantic patches” of nearby concepts in memory. The partial order comes from our ability to maintain more than one concept in working memory, so that the next concept can be “foraged” from any of the other recently recalled ones[3], [4].

2 though some domains do imbue special meaning to the listed author order, a fact that might be interesting to investigate in future network recovery efforts!

[2]
T. T. Hills, P. M. Todd, and M. N. Jones, “Foraging in semantic fields: How we search through memory,” Topics in Cognitive Science, vol. 7, no. 3, pp. 513–534, Jun. 2015, doi: 10.1111/tops.12151.
[4]
T. T. Hills and T. Pachur, “Dynamic search and working memory in social recall.” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 38, no. 1, pp. 218–228, 2012, doi: 10.1037/a0025161.
[6]
M. P. Brundage, R. Sexton, M. Hodkiewicz, A. Dima, and S. Lukens, “Technical language processing: Unlocking maintenance knowledge,” Manufacturing Letters, vol. 27, pp. 42–46, Jan. 2021, doi: 10.1016/j.mfglet.2020.11.001.
[7]
R. Sexton and M. Fuge, “Using semantic fluency models improves network reconstruction accuracy of tacit engineering knowledge,” in Volume 2A: 45th design automation conference, American Society of Mechanical Engineers, Aug. 2019. doi: 10.1115/detc2019-98429.
[8]
R. Sexton and M. Fuge, “Organizing tagged knowledge: Similarity measures and semantic fluency in structure mining,” Journal of Mechanical Design, vol. 142, no. 3, Jan. 2020, doi: 10.1115/1.4045686.

In this section, we briefly cover a method for network inference by [5] that utilizes partial order information from ordered lists of concepts, called INVITE. We use it to demonstrate improvement over bag-of-words and markov assumptions for downstream technical language processing [6] tasks, as originally demonstrated in [7], and [8]

Finally, we show that using Forest Pursuit for partially ordered data can still be quite useful for network backboning, and for a fraction of the computational cost. We investigate a network recovery task from verbal/semantic fluency data [9], which involves recovery of a network of animal relationships from memory and recall experiments. Even without directly using partial order information, proper data preparation along with previously-discussed (un-ordered) recovery methods can lead to significantly improved network backboning and analysis capability

Technical Language Processing with INVITE

Maintenance work orders are often represented as categorized data, though increasingly quantitative use of a technician’s descriptions is being pursued by maintenance management operations [10], [11]. Tags are another way to flexibly structure this otherwise “unstructured” data, which Table 9.1 shows in comparison to more traditional categorization.

[11]
R. Sexton, M. Hodkiewicz, and M. P. Brundage, “Categorization errors for data entry in maintenance work-orders,” Annual Conference of the PHM Society, vol. 11, no. 1, Sep. 2019, doi: 10.36001/phmconf.2019.v11i1.790.
Table 9.1: Maintenance Work Order as categorized vs. tagged data
(a)

Whether entered directly or generated from text by keyword extraction, the tags will tend to have ordering information readily available. A traditional way to model this kind of text is through either bag-of-words (the co-occurrence node activation data already discussed) or as a sequence of order-n markov model emissions. An order-n markov model \(\text{MC}n\) estimates the probability of observing the \(i\)th item \(t_i\) in a sequence \(T\) as \[ P(t_i|T) \approx P(t_i | t_{i-1}, \cdots,t_{i-n}) \]

Unlike the clique bias from before, assuming markov jumps for each observation leads to a different kind of bias, with higher precision but reduced recall as shown in Figure 9.2.

Figure 9.2: Partial-order edge measurements with Markov assumption

Without knowing the underlying dependency relationships, it’s difficult to estimate which edges were used by a random-walker, since subsequent visits in memory to a “tag” are not being reported once a technician first adds it. [5] call this an “Initial-visit-emitting” random walk, or INVITE for short. To more accurately recover the network, they suggest maximizing the absorption probability for each step of a partial order, individually, knowing which nodes have already been activated.

Optimizing absorbing-state probabilities

Say the set of components or concepts that have a corresponding tag in our system is denoted by the node-set \(N\). A user-given set of \(T\) 3 for a specific record can be denoted as a Random Walk (RW) trajectory \(\mathbf{t}=\{t_1, t_2, t_3, \cdots t_{T}\}\), where \(T\leq N\). This limit on the size of \(T\) assumes tags are a set of unique entries. Any transitions between previously visited tags in \(\mathbf{t}\) will not be directly observed, making the transitions observed in \(\mathbf{t}\) strictly non-Markov, and allowing for a potentially infinite number of possible paths to arrive at the next tag through previously visited ones. Instead of directly computing over this intractable model for generating \(\mathbf{t}\), the key insight from the original INVITE paper [5] comes from partitioning \(\mathbf{t}\) into \(T-1\) Markov chains with absorbing states. Previously visited tags are “transient” states, and unseen tags are “absorbing”. It is then possible to calculate the absorption probability into the \(k\) transition (\(t_k \rightarrow t_{k+1}\)) using the fundamental matrix of each partition. If the partitions at this jump consist of \(q\) transient states with transition matrix among themselves \(\mathbf{Q}^{(k)}_{q\times q}\), and \(r\) absorbing states with transitions into them from \(q\) as \(\mathbf{R}^{(k)}_{q\times r}\), the Markov transition matrix \(\mathbf{M}^{(k)}_{n\times n}\) has the form \[ \mathbf{M}^{(k)} = \begin{pmatrix} \mathbf{Q}^{(k)} & \mathbf{R}^{(k)} \\ \mathbf{0} & \mathbf{I} \end{pmatrix} \tag{9.1}\]

3 While some sources use “tagging” as a proxy for a set of strictly un-ordered labels (as in multi-label classification), we preserve the mechanism by which the tags were generated in the first place, i.e., in a specific order.

[5]
K.-S. Jun, J. Zhu, T. T. Rogers, Z. Yang, et al., “Human memory search as initial-visit emitting random walk,” Advances in neural information processing systems, vol. 28, no. 20, pp. 2389–2393, 2015, doi: 10.1016/j.physleta.2019.04.060.
[12]
P. G. Doyle and J. L. Snell, “Random walks and electric networks.” arXiv, 2000. doi: 10.48550/ARXIV.MATH/0001057.

Here \(\mathbf{0}\), \(\mathbf{I}\) represent lack of transition between/from absorbing states. It follows from [12] that the probability \(P\) of a chain starting at \(t_k\) being absorbed into state \(k+1\), is given as \[ \begin{gathered} P\left(t_{k+1} \middle| t_{1:k},\mathbf{M}\right) = \left.\mathbf{N}^{(k)}R^{(k)}\right|_{q,1}\\ \mathbf{N} = \left( \mathbf{I}-\mathbf{Q} \right) ^{-1} \end{gathered} \tag{9.2}\]

The probability of being absorbed at \(k+1\) conditioned on jumps \(1:k\) is thus equivalent to the probability of observing the \(k+1\) INVITE tag. If we approximate an a priori distribution of tag probabilities to initialize our chain as \(t_1\sim\text{Cat}(n,\theta)\) (which could be empirically derived or simulated), then the likelihood of our observed tag chain \(\mathbf{t}\), given a transition matrix, is: \[ \mathcal{L}\left(\mathbf{t}| \theta, \mathbf{M}\right) = \theta(t_1)\prod_{k=1}^{T-1} P\left(t_{k+1}\,\middle|\ t_{1:k},\mathbf{M}\right) \]

Finally, if we observe a set of tag lists \(\mathbf{C} = \left\{ \mathbf{t}_1, \mathbf{t}_2, \cdots, \mathbf{t}_{c} \right\}\), and assume \(\theta\) can be estimated independently of \(\mathbf{M}\), then we can frame the problem of structure mining on observed INVITE data as a minimization of negative log-likelihood. A point estimate for our association network given \(\mathbf{M}\) can found as: \[ \mathbf{M}^* \leftarrow \operatorname*{argmin}_{\mathbf{M}} \quad \sum_{i=1}^{C} \sum_{k=1}^{T_i-1} -\log \mathcal{L} \left(t^{(i)}_{k+1} \middle| t^{(i)}_{1:k},\mathbf{M}\right) \]

This (deeply nested) likelihood can now be optimized using standard solvers, and our reference implementation uses stochastic gradient descent via PyTorch [13].4

[13]
A. Paszke et al., “Automatic differentiation in PyTorch,” in NIPS-w, 2017.

4  The n-gram markov transition models (MC1,MC2) trained for comparison vs INVITE were trained using pomegranate[14].

[14]
J. Schreiber, “Pomegranate: Fast and flexible probabilistic modeling in python,” Journal of Machine Learning Research, vol. 18, no. 164, pp. 1–6, 2018.

Application: Mining Excavator MWOs

To assess the applicability of the INVITE-based similarity measure to real-world scenarios, we apply our model to tags annotated for a mining dataset pertaining to 8 similarly-sized excavators at various sites across Australia [15], [16].

[15]
M. R. Hodkiewicz, Z. Batsioudis, T. Radomiljac, and M. T. Ho, “Why autonomous assets are good for reliability–the impact of ‘operator-related component’failures on heavy mobile equipment reliability,” in Annual conference of the prognostics and health management society 2017, 2017.
[16]
M. Hodkiewicz and M. T.-W. Ho, “Cleaning historical maintenance work order data for reliability analysis,” Journal of Quality in Maintenance Engineering, vol. 22, no. 2, pp. 146–163, 2016.
[17]
R. B. Sexton and M. B. Brundage, “Nestor: A tool for natural language annotation of short texts,” Journal of Research of the National Institute of Standards and Technology, vol. 124, Nov. 2019, doi: 10.6028/jres.124.029.
[10]
R. Sexton, M. Hodkiewicz, M. P. Brundage, and T. Smoker, “Benchmarking for keyword extraction methodologies in maintenance work orders,” Annual Conference of the PHM Society, vol. 10, no. 1, Sep. 2018, doi: 10.36001/phmconf.2018.v10i1.541.

The tags were created by a subject-matter expert spending 1 hour of time in the annotation assistance tool nestor [17], using a methodology outlined in a previous benchmarking study for that annotation method [10].

That work compared the ability of tags to estimate survival curves and mean time-to-failure, when compared with a custom-designed keyword extraction tool based on classifying the maintenance issues by subsystem. While certain sets of tags were able to predict time-to-failure with high accuracy for certain subsystems, a key problem identified in that work is knowing a priori which tags best indicate when a subsystem is failing?

Which tags best represent a given subsystem?

Some tags are sufficient (albeit unnecessary) conditions to indicate a subsystem. That the “hydraulic” tag indicates a Hydraulic System MWO is obvious, but so might a “valve” tag—“hydraulic” is implied but not present. Consequently, we can treat the problem of assigning tags to represent a subsystem as a semi-supervised multi-class classification problem in the style of [18]. Like in that work, we need to know a selection tag\(\rightarrow\)subsystem assignments, as well as network of weighted tag-tag edges.

[18]
K. Avrachenkov, P. Chebotarev, and A. Mishenin, “Semi-supervised learning with regularized laplacian,” Optimization Methods and Software, vol. 32, no. 2, pp. 222–236, Mar. 2017, doi: 10.1080/10556788.2016.1193176.
[19]
M. Ho, “A shared reliability database for mobile mining equipment,” PhD thesis, University of Western Australia, 2015.

Then, if we can compare the semi-supervised tag classifications to a ground-truth classification by a human annotator (which are available for the excavator dataset thanks to [19]), we can assess the ability of each network to capture the human annotator’s internal/cognitive tag relationship structure.

Which network assigns tags to subsystems most like a domain expert?

To test the ability of the similarity measures to accomplish this, the top three most common subsystems in the data were used as classes, namely, Hydraulic System, Engine, and Bucket. The tags “hydraulic”, “engine”, and “bucket” were assigned to those subsystems as known labels, respectively. Tags were filtered to only include ones of high-importance and sufficient information: only work orders containing at least 3 unique tags, and only tags that occurred at least 10 unique times within the those work orders, were included for this analysis (\(C=263\) MWOs, \(N=40\) tags). Then the number of occurrences for every tag can be compared across subsystems, giving each tag a ground-truth multinomial (categorical) probability distribution for occurring within each subsystem. We determine ground-truth classification labels as subsystems that account for \(\geq60\%\) of each tag’s occurrences. Tags more balanced than that are considered “unknown subsystem”.

To perform semi-supervised classification on the recovered relationship graphs, we use a label-spreading algorithm described in [20], which itself was inspired by spreading activation networks in experimental psychology [21], [22]. The result of this algorithm is tags having a score for each class, with the classification being the maximally scored class for that tag. These class assignments can then be compared to the ground-truth labels, which we have done by weighted macro-averaging of the \(F_1\)-score (see Figure 9.3 (a)).

[20]
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, 2004, pp. 321–328.
[21]
J. R. Anderson, The architecture of cognition. Psychology Press, 2013.
[22]
J. Shrager, T. Hogg, and B. A. Huberman, “Observation of phase transitions in spreading activation networks,” Science, vol. 236, no. 4805, pp. 1092–1094, 1987.
(a)
(b)
Figure 9.3: Semisupervised MWO Tag Classification with Network Label Propagation

The classification of the INVITE-based similarity measure far outperforms the other measures as a preprocessor for label-spreading, when measured by average \(F_1\)-score. However, since these “classifications” are actually thresholded multinomial distributions (with some tags regularly occurring across multiple subsystems), how do we know if an underlying structure has actually been recovered, rather than simply a black-box classifier that happens to perform well at this setting?

To begin answering this question, we might ask whether the relative scores returned by label-spreading are similar to the original multinomial distributions themselves, rather than the overall classification. To find out, we use softmax normalization5 to transform each tag’s scores into a “predicted multinomial”, before finally calculating the Kullback-Leibler divergence (KLD) between the true and predicted multinomials for every tag. The total KLD, summed over all tags, is also shown in Figure 9.3 (b), along with positions of each tag’s multinomial as projected onto the 2-simplex for the true and \(F_1\)-optimal predicted distributions. Once again, the INVITE performs much better at this task, over a wide range of \(\sigma\) (lower is better).

5 For visualization, a temperature parameter was added to softmax, and this was optimized for minimum KLD via Brent’s method [23] for each similarity measure independently to provide an equal footing for comparison.

[23]
R. P. Brent, “An algorithm with guaranteed convergence for finding a zero of a function,” The Computer Journal, vol. 14, no. 4, pp. 422–425, 1971.

A reason for the performance disparity can be seen in the simplex projections: recovered topology via INVITE-similarity does a much better job of separating the three classes, while not letting any single tag overcompensate by dominating a subsystem’s area. Even the “unknown” tags are correctly placed roughly between Bucket and Hydraulic System regions, reflecting the true topology of the system.

Forest Pursuit Animal Network

For our last case study, we return to Forest Pursuit as a useful tool for analysis even when that might mean ignoring partial ordering information. Note that Equation 9.2 is effectively the same as a (\(\beta=1\)) forest matrix if the transition (normalized adjacency) matrix was replaced by a (sub-)graph laplacian. Intuitively, the INVITE loss function is summing up over absorption (log-)probabilities at each new node activation: i.e. each step of a tree’s creation. Because the probability of a sampled tree is precisely proportional to the product of its edge weights, then weighing a tree by its absorption probabilities and running INVITE should have a mathematically similar effect as the FP estimate. The similarity would be exact whenever the tree that set of nodes traveled along was the mode of the tree distribution: the MST.

In other words, whenever the random walks did use the minimum distance to reach each node, the two methods should be equivalent. While this isn’t happening much individually, the effect of many random walks will average out to this MST, precisely because it is the tree distribution mode from which we assume node activations sample under marked Random Spanning Forests.

To illustrate FP’s efficacy in network estimation despite ignoring partial order information, we turn to a classic network recovery problem in this space: semantic networks from fluency data[9], [24], [25].

[24]
T. J. Prescott, L. D. Newton, N. U. Mir, P. W. R. Woodruff, and R. W. Parks, “A new dissimilarity measure for finding semantic structure in category fluency data with implications for understanding memory organization in schizophrenia.” Neuropsychology, vol. 20, no. 6, pp. 685–699, Nov. 2006, doi: 10.1037/0894-4105.20.6.685.

Domain-aware preprocessing

Verbal (semantic) fluency tests involve asking participants to list as many items belonging to a prompted category in their available time. For instance, an “animals” prompt could lead to an answer like “dog, cat, lion, tiger, bear, wolf…”, etc. Like before, the general idea is that recall of each item derives from a random walk in a cognitive “memory space”, with emissions (usually) only when a new animal is encountered. A participant might have backtracked internally to dog from bear and jumped to wolf, for example.

Using a high-dimensional multilabel embedding space is possible, with one-column-per-animal, but the lists tend to be quite long and give networks with “hairball” tendencies. However, if our intention is to find dependencies between animals for a large number of participants, we might reasonably limit the co-occurrence to only a nearby subset of each list. Effectively, we can limit the set of possible co-occurring animals based on our domain knowledge, namely, the limits of working memory. Conservatively, with a well-cited work claiming 7 \(\pm\) 2 semantic units at a time in working memory[3], we can limit the set of possible dependencies for a given item to the 5 items on either side. The 5 previous might have originated a jump to the current term, and the 5 after might be subsequent targets, for a rolling window of 10 terms at-a-time.6

[3]
G. A. Miller, “The magical number seven, plus or minus two: Some limits on our capacity for processing information.” Psychological Review, vol. 63, no. 2, pp. 81–97, Mar. 1956, doi: 10.1037/h0043158.

6  This is precisely the logic that leads to the use of Skip-Gram and Continuous-Bag-of-Words transformations of text to weakly supervise word2vec and GloVe models[26], [27].

[26]
J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[27]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[9]
J. C. Zemla and J. L. Austerweil, “Estimating semantic networks of groups and individuals from fluency data,” Computational brain & behavior, vol. 1, pp. 36–58, 2018, doi: 10.31234/osf.io/kg45r.

For our case study, we use lists of animals submitted by participants as described in [9], [25]. We limit the animals under consideration to those that occurred in more than 30 lists for a good sample size, as well as lists with at least two animals. This resulted in 100 animals over 293 fluency lists. However, we ignore this filtering when creating the 10-animal rolling windows, to avoid inclusion of unrelated animals into prematurely filtered windows. After re-applying the filter, 100 animals appear across 8020 working-memory windows. Figure 9.4 shows the effects of this preprocessing on marginal distributions.

Figure 9.4: Effects of rolling-window activations on observation data

Doing this preprocessing (for rolling-windows of 10) shifts relative animal frequencies downward (since there are many more “observations” from the rolling window), while also shifting the number of animals-per-observation to be strictly less than 10. As desired, the pairwise cosine similarity of all vectors \(\mathbf{x}'_{j'}, \mathbf{x}'_{j}\) is significantly reduced. While many participants might cover similar animals overall, we want to investigate animal dependencies locally, and we don’t expect individuals to always recall animals in the same memory “area” the whole time.

Edge Connective Efficiency and Diversity

To compare the results of different backboning techniques, we introduce a new simple measure to quantify a network’s sparsity, in terms of how many edges more than \(n-1\) (the minimum number needed to connect all \(n\) nodes) are being used. \[ \begin{gathered} \psi_n(e) = \frac{e_{\text{min}}}{e}\frac{e_{\text{max}}-e}{e_{\text{max}}-e_{\text{min}}}\\ e_{\text{max}} = \frac{n(n-1)}{2} \quad e_{\text{min}} = n-1 \end{gathered} \tag{9.3}\]

We call \(\psi\) the graph’s “connective efficiency”, and it will range from 0 when the graph is fully connected, to 1 when it is a tree, to >1 when it has insufficient edges to be connected. This measure is intended to compare graphs that have had edges removed until it is about to be disconnected, such as with the Doubly Stochastic Filter (DS) [28]. However, values greater than 1 also give insight to how much sparser than a tree some graph is.

[28]
P. B. Slater, “A two-stage algorithm for extracting the multiscale backbone of complex weighted networks,” Proceedings of the National Academy of Sciences, vol. 106, no. 26, pp. E66–E66, Jun. 2009, doi: 10.1073/pnas.0904725106.
[25]
J. Goñi et al., “The semantic organization of the animal category: Evidence from semantic verbal fluency and network theory,” Cognitive processing, vol. 12, no. 2, pp. 183–196, 2011.

The DS animal network is shown in Figure 9.5. With large, deeply connected clusters centered around contexts animals are found in Figure 9.5 looks very similar to the network recovered to make Fig. 4 in [25]. Clusters approximate animal types by their location relative to humans: farm/livestock, ocean/water creatures, “African” and jungle animals, small indoor mammals, small outdoor mammals, etc.

Figure 9.5: Verbal Fluency (animals) Network Backbone (Doubly-Stochastic)

This is largely how the literature on semantic fluency leaves their network recovery solution, with clusters based on human proximity or physical location. However, there are other ways people might relate animals than simply by location. Additionally, clique bias is quite strong in this network: why must every farm animal be mutually connected if its possible to recall any of them through one or two “hub” animals? This is related to the incredible inefficiency of this backbone, with a \(\psi=0.35\) being rather closer to fully connected than sparse. Additionally, the two animals that seem to appear regardless of others are cat and dog, which ironically makes DS penalize their proximity to any of the clusters. Both are ironically clustered with ocean animals due to their tendency to be listed near fish.

The Chow-Liu tree network is shown in Figure 9.6, and goes some way to alleviating these issues. Clusters are largely intact, instead represented by large branches/subtrees off the main group. However, some community relationships have been sacrificed to maintain strong individual edges, such as monkey+giraffe for location similarity at the expense of separating two halves of the pink cluster across a wide distance. More alternate paths between creatures (i.e. loops) are needed to better represent our perception of animal relationships.

Figure 9.6: Verbal Fluency (animals) Dependency Network (Chow-Liu Tree)

The other dependency network recovery method is GLASSO, which we have similarly thresholded at the minimum-connected point. It only slightly improves on connective efficiency (\(\psi=0.44\)), though the cliques are replaced with much more dispersed connections throughout the graph. We also see that reasonable inter-group connections are better represented, such as rabbit+squirrel, though cat and dog are still isolated due to overrepresentation throughout the dataset.

Figure 9.7: Verbal Fluency (animals) Dependency Network (GLASSO)

Subjectively, the GLASSO network is still difficult for an analyst to synthesize into useful knowledge, with so many edges, while the DS network only really managed to communicate one “kind” of knowledge (the context clusters). We would ideally prefer a backbone that provides a wider diversity of important edge “types”, for an analyst to better understand the kinds of animal relationships humans perceive.

To illustrate this, we show the Forest Pursuit(FP) network in Figure 9.8. It has also been filtered to minimum-connected, like DS and GLASSO, though in this case the connective efficiency to reach that threshold is a staggering \(\psi=0.88\).

Figure 9.8: Verbal Fluency (animals) Dependency Network (Forest Pursuit)

Unlike the other networks that push “generalist” nodes like cat and dog onto long, distant chains, those chains are used in the FP network to hold rare subgroups of clusters, treating them as “gated” by the prominent “hubs” of those groups. For example, cat is correctly linked to mouse and lion (in addition to bird), while lemur is pushed down a longer chain of primates, “gated” by gorilla. Similarly with eel through lobster and crab.

A much broader edge-type diversity is also made apprarent with many non-context-based relationships made obvious with the improved sparsity. An analyst has an easier job of creating “edge-type inventories”, making the FP backbone an excellent exploratory assistant: animals can be related because they are:

  • Co-located
  • Taxonomically similar (cheetah+leopard)
  • Famous predator/prey pairs (cat+mouse)
  • Pop-culture references (lion\(\rightarrow\)tiger\(\rightarrow\)bear)7
  • Similar in ecological niche/role (koala+sloth)
  • lexically similar/rhyming (moose+goose)
  • Related through conservation or public awareness (panda+gorilla)
  • etc.

7 Note that the dependency-based methods correctly interpred these three as not being mutually connected in a triad, but specifically with this ordering (tiger in the middle).

This is further reflected in how FP alters the way centrality measures behave. Replicating Figure 8.7 for these graphs, Figure 9.9 shows the change in rank across the top 15 animals for the DS, GLASSO, and FP networks.

Figure 9.9: Changes in animal centrality ranking for FP vs co-occurrence,GLASSO

The DS centrality finds the most densely connected clique and gives all of its members incredibly high values. Meanwhile, the none of the top 5 most common animals (dog,cat,lion,tiger,bear)8 have high centrality at all. Both GLASSO and DS have farm animals (chicken,goat,cow) as the most important, despite the idea that goat likely could be reached from e.g. cow quite often. FP adds more variety, giving hub-animals from different communities high centrality scores, each of which could lead to a variety of different paths. While subjective, the ranks from FP appear to be a more holistic inventory of “lynchpin” animals, which provide nearby coverage for a large amount of others.

8 Interestingly, dog never appears in centrality measures, and none of the networks connect dog to any other animal than cat. Meanwhile, wolf is associated more with fox, coyote, dingo, etc., which are notably all predators of farm animals.
With the primary community structures being what they are (context/location-based), it seems that humans tend to put dogs in a category all their own.

Thresholded Structure Preservation

Another beneficial feature of Forest Pursuit is how it fares under increased thresholding. Because co-occurrence methods prioritize cliques, those cliques will remain connected as other edges are removed, effectively destroying the global structure of the animal network past a certain threshold. As seen in Figure 9.10, in keeping only the top 2% of edges, they are used to connect separated islands of animal communities.

(a) co-occurrence retains local communities at the cost of global structure
(b) clique-bias correction preserves central structure by disconnecting rare nodes
Figure 9.10: Differences in structural preservation with over-thresholding.

Meanwhile, the FP network is quite robust to this excessive thresholding, with the global structure preserved at 2%. The removed edges have simply detached some of the rarer animals from the network entirely. These isolates not only reflect a more metrologically-sound idea (that rarer nodes would be disconnected at higher certainty thresholds), but are also beneficial to analysts, since manually re-connecting isolates of rare animals is simpler than manually determining a global reconnection strategy for island groups.

Forest Pursuit as Preprocessing

Because Forest Pursuit creates a representation of the observed data in edge space, we can use Equation 6.4 in the forward direction, creating a “new”design matrix \(X\gets BR\). As discussed in Problem Specification, Equation 6.4 will create a design matrix of interaction counts for each node (its degree in the steiner tree approximation), rather than a binary “on/off” indicator.

By supplying other algorithms with this new estimate for \(X\), this makes FP a kind of preprocessing on the data itself. We can do this to bias the other methods toward greater sparsity in the backbone, without explicitly relying on point estimates for each observation (any tree with those node degrees would do the same). As shown in Figure 9.11, GLASSO and DS both increase their connective efficiency under FP preprocessing.

(a) Doubly Stochastic filter with FP (node degree) preprocessing
(b) GLASSO Precision estimate with FP (node-degree) preprocessing
Figure 9.11: Forest Pursuit preprocessing for Doubly-Stochastic and GLASSO recovered networks

This is likely due to the increase in “signal-to-noise” ratio for each datapoint, since observations are only similar when they have a node interacting the same “amount”, not merely when similar nodes are activated.

Because the entries are now integer counts, FP preprocessing might also give a better path to using distribution-based embedding and clustering techniques, such as Hellinger distances between multinomial sample counts. This goes for many other techniques from text processing that rely on multinomial assumptions (i.e. techniques otherwise inapplicable to binary data).
FP Preprocessing, with further empirical and theoretical validation, might prove to be a powerful tool for practitioners to flexibly backbone and analyse their networks with a variety of new techniques.