Recovery from Working Memory & Partial Orders
“So an analytically astute observer would find that a salmon is more closely related to a camel than it is to a hagfish. On what basis, then, do we justify grouping salmon under the label fish, but not camels?”
– Samuel Beswick (sarcastically)
This whole time we have assumed that the ordering of nodes was unknown, or at the very least unreliable. However, there are frequently cases, especially in text processing applications, where we have some sense of an ordering on activations. By partial order on a set (a “poset”), we mean that all elements are either comparable as greater (before) or less (after), or incomparable. The set of posets is therefore precisely isomorphic to the set of directed acyclic graphs, based on reachability.
Our original example with authors might be thought of as a poset: (i) precedes/asks (f) and (j), (j) precedes/asks (b), but (b) and (f) are incomparable. We don’t know if (i) asked (f) before or after (j) asked (b).1 We have updated the original example with explicit partial orders in Figure 9.1. Note that nodes in some lists could be re-arranged while keeping the partial order the same (keeping all arrows pointing to the right).
1 Interestingly, in the case where the node activations are given a total order (in the form of timestamps), [1] derive an algorithm called NetInf. It utilizes sums over minimum spanning trees that satisfy time constraints, similar to a special case of Forest Pursuit where all activations have a total ordering.
Sadly our co-authorship example does not often include the (partial) order of author additions,2 but other common network recovery problems do have an inherent order to them. A very common need is to recover concept association networks, whether from lists of tags or directly from a corpus of written language. What’s needed is an assumption on how the observed partial order of concepts is generated. [2] proposes a “foraging” mechanism, so that concepts get sequentially recalled from “semantic patches” of nearby concepts in memory. The partial order comes from our ability to maintain more than one concept in working memory, so that the next concept can be “foraged” from any of the other recently recalled ones[3], [4].
2 though some domains do imbue special meaning to the listed author order, a fact that might be interesting to investigate in future network recovery efforts!
In this section, we briefly cover a method for network inference by [5] that utilizes partial order information from ordered lists of concepts, called INVITE. We use it to demonstrate improvement over bag-of-words and markov assumptions for downstream technical language processing [6] tasks, as originally demonstrated in [7], and [8]
Finally, we show that using Forest Pursuit for partially ordered data can still be quite useful for network backboning, and for a fraction of the computational cost. We investigate a network recovery task from verbal/semantic fluency data [9], which involves recovery of a network of animal relationships from memory and recall experiments. Even without directly using partial order information, proper data preparation along with previously-discussed (un-ordered) recovery methods can lead to significantly improved network backboning and analysis capability
Technical Language Processing with INVITE
Maintenance work orders are often represented as categorized data, though increasingly quantitative use of a technician’s descriptions is being pursued by maintenance management operations [10], [11]. Tags are another way to flexibly structure this otherwise “unstructured” data, which Table 9.1 shows in comparison to more traditional categorization.
Whether entered directly or generated from text by keyword extraction, the tags will tend to have ordering information readily available. A traditional way to model this kind of text is through either bag-of-words (the co-occurrence node activation data already discussed) or as a sequence of order-n markov model emissions. An order-n markov model \(\text{MC}n\) estimates the probability of observing the \(i\)th item \(t_i\) in a sequence \(T\) as \[ P(t_i|T) \approx P(t_i | t_{i-1}, \cdots,t_{i-n}) \]
Unlike the clique bias from before, assuming markov jumps for each observation leads to a different kind of bias, with higher precision but reduced recall as shown in Figure 9.2.
Without knowing the underlying dependency relationships, it’s difficult to estimate which edges were used by a random-walker, since subsequent visits in memory to a “tag” are not being reported once a technician first adds it. [5] call this an “Initial-visit-emitting” random walk, or INVITE for short. To more accurately recover the network, they suggest maximizing the absorption probability for each step of a partial order, individually, knowing which nodes have already been activated.
Optimizing absorbing-state probabilities
Say the set of components or concepts that have a corresponding tag in our system is denoted by the node-set \(N\). A user-given set of \(T\) 3 for a specific record can be denoted as a Random Walk (RW) trajectory \(\mathbf{t}=\{t_1, t_2, t_3, \cdots t_{T}\}\), where \(T\leq N\). This limit on the size of \(T\) assumes tags are a set of unique entries. Any transitions between previously visited tags in \(\mathbf{t}\) will not be directly observed, making the transitions observed in \(\mathbf{t}\) strictly non-Markov, and allowing for a potentially infinite number of possible paths to arrive at the next tag through previously visited ones. Instead of directly computing over this intractable model for generating \(\mathbf{t}\), the key insight from the original INVITE paper [5] comes from partitioning \(\mathbf{t}\) into \(T-1\) Markov chains with absorbing states. Previously visited tags are “transient” states, and unseen tags are “absorbing”. It is then possible to calculate the absorption probability into the \(k\) transition (\(t_k \rightarrow t_{k+1}\)) using the fundamental matrix of each partition. If the partitions at this jump consist of \(q\) transient states with transition matrix among themselves \(\mathbf{Q}^{(k)}_{q\times q}\), and \(r\) absorbing states with transitions into them from \(q\) as \(\mathbf{R}^{(k)}_{q\times r}\), the Markov transition matrix \(\mathbf{M}^{(k)}_{n\times n}\) has the form \[ \mathbf{M}^{(k)} = \begin{pmatrix} \mathbf{Q}^{(k)} & \mathbf{R}^{(k)} \\ \mathbf{0} & \mathbf{I} \end{pmatrix} \tag{9.1}\]
3 While some sources use “tagging” as a proxy for a set of strictly un-ordered labels (as in multi-label classification), we preserve the mechanism by which the tags were generated in the first place, i.e., in a specific order.
Here \(\mathbf{0}\), \(\mathbf{I}\) represent lack of transition between/from absorbing states. It follows from [12] that the probability \(P\) of a chain starting at \(t_k\) being absorbed into state \(k+1\), is given as \[ \begin{gathered} P\left(t_{k+1} \middle| t_{1:k},\mathbf{M}\right) = \left.\mathbf{N}^{(k)}R^{(k)}\right|_{q,1}\\ \mathbf{N} = \left( \mathbf{I}-\mathbf{Q} \right) ^{-1} \end{gathered} \tag{9.2}\]
The probability of being absorbed at \(k+1\) conditioned on jumps \(1:k\) is thus equivalent to the probability of observing the \(k+1\) INVITE tag. If we approximate an a priori distribution of tag probabilities to initialize our chain as \(t_1\sim\text{Cat}(n,\theta)\) (which could be empirically derived or simulated), then the likelihood of our observed tag chain \(\mathbf{t}\), given a transition matrix, is: \[ \mathcal{L}\left(\mathbf{t}| \theta, \mathbf{M}\right) = \theta(t_1)\prod_{k=1}^{T-1} P\left(t_{k+1}\,\middle|\ t_{1:k},\mathbf{M}\right) \]
Finally, if we observe a set of tag lists \(\mathbf{C} = \left\{ \mathbf{t}_1, \mathbf{t}_2, \cdots, \mathbf{t}_{c} \right\}\), and assume \(\theta\) can be estimated independently of \(\mathbf{M}\), then we can frame the problem of structure mining on observed INVITE data as a minimization of negative log-likelihood. A point estimate for our association network given \(\mathbf{M}\) can found as: \[ \mathbf{M}^* \leftarrow \operatorname*{argmin}_{\mathbf{M}} \quad \sum_{i=1}^{C} \sum_{k=1}^{T_i-1} -\log \mathcal{L} \left(t^{(i)}_{k+1} \middle| t^{(i)}_{1:k},\mathbf{M}\right) \]
This (deeply nested) likelihood can now be optimized using standard solvers, and our reference implementation uses stochastic gradient descent via PyTorch [13].4
4 The n-gram markov transition models (MC1,MC2) trained for comparison vs INVITE were trained using pomegranate
[14].
Application: Mining Excavator MWOs
To assess the applicability of the INVITE-based similarity measure to real-world scenarios, we apply our model to tags annotated for a mining dataset pertaining to 8 similarly-sized excavators at various sites across Australia [15], [16].
The tags were created by a subject-matter expert spending 1 hour of time in the annotation assistance tool nestor
[17], using a methodology outlined in a previous benchmarking study for that annotation method [10].
That work compared the ability of tags to estimate survival curves and mean time-to-failure, when compared with a custom-designed keyword extraction tool based on classifying the maintenance issues by subsystem. While certain sets of tags were able to predict time-to-failure with high accuracy for certain subsystems, a key problem identified in that work is knowing a priori which tags best indicate when a subsystem is failing?
Which tags best represent a given subsystem?
Some tags are sufficient (albeit unnecessary) conditions to indicate a subsystem. That the “hydraulic” tag indicates a Hydraulic System MWO is obvious, but so might a “valve” tag—“hydraulic” is implied but not present. Consequently, we can treat the problem of assigning tags to represent a subsystem as a semi-supervised multi-class classification problem in the style of [18]. Like in that work, we need to know a selection tag\(\rightarrow\)subsystem assignments, as well as network of weighted tag-tag edges.
Then, if we can compare the semi-supervised tag classifications to a ground-truth classification by a human annotator (which are available for the excavator dataset thanks to [19]), we can assess the ability of each network to capture the human annotator’s internal/cognitive tag relationship structure.
Forest Pursuit Animal Network
For our last case study, we return to Forest Pursuit as a useful tool for analysis even when that might mean ignoring partial ordering information. Note that Equation 9.2 is effectively the same as a (\(\beta=1\)) forest matrix if the transition (normalized adjacency) matrix was replaced by a (sub-)graph laplacian. Intuitively, the INVITE loss function is summing up over absorption (log-)probabilities at each new node activation: i.e. each step of a tree’s creation. Because the probability of a sampled tree is precisely proportional to the product of its edge weights, then weighing a tree by its absorption probabilities and running INVITE should have a mathematically similar effect as the FP estimate. The similarity would be exact whenever the tree that set of nodes traveled along was the mode of the tree distribution: the MST.
In other words, whenever the random walks did use the minimum distance to reach each node, the two methods should be equivalent. While this isn’t happening much individually, the effect of many random walks will average out to this MST, precisely because it is the tree distribution mode from which we assume node activations sample under marked Random Spanning Forests.
To illustrate FP’s efficacy in network estimation despite ignoring partial order information, we turn to a classic network recovery problem in this space: semantic networks from fluency data[9], [24], [25].
Domain-aware preprocessing
Verbal (semantic) fluency tests involve asking participants to list as many items belonging to a prompted category in their available time. For instance, an “animals” prompt could lead to an answer like “dog, cat, lion, tiger, bear, wolf…”, etc. Like before, the general idea is that recall of each item derives from a random walk in a cognitive “memory space”, with emissions (usually) only when a new animal is encountered. A participant might have backtracked internally to dog from bear and jumped to wolf, for example.
Using a high-dimensional multilabel embedding space is possible, with one-column-per-animal, but the lists tend to be quite long and give networks with “hairball” tendencies. However, if our intention is to find dependencies between animals for a large number of participants, we might reasonably limit the co-occurrence to only a nearby subset of each list. Effectively, we can limit the set of possible co-occurring animals based on our domain knowledge, namely, the limits of working memory. Conservatively, with a well-cited work claiming 7 \(\pm\) 2 semantic units at a time in working memory[3], we can limit the set of possible dependencies for a given item to the 5 items on either side. The 5 previous might have originated a jump to the current term, and the 5 after might be subsequent targets, for a rolling window of 10 terms at-a-time.6
6 This is precisely the logic that leads to the use of Skip-Gram and Continuous-Bag-of-Words transformations of text to weakly supervise word2vec and GloVe models[26], [27].
For our case study, we use lists of animals submitted by participants as described in [9], [25]. We limit the animals under consideration to those that occurred in more than 30 lists for a good sample size, as well as lists with at least two animals. This resulted in 100 animals over 293 fluency lists. However, we ignore this filtering when creating the 10-animal rolling windows, to avoid inclusion of unrelated animals into prematurely filtered windows. After re-applying the filter, 100 animals appear across 8020 working-memory windows. Figure 9.4 shows the effects of this preprocessing on marginal distributions.
Doing this preprocessing (for rolling-windows of 10) shifts relative animal frequencies downward (since there are many more “observations” from the rolling window), while also shifting the number of animals-per-observation to be strictly less than 10. As desired, the pairwise cosine similarity of all vectors \(\mathbf{x}'_{j'}, \mathbf{x}'_{j}\) is significantly reduced. While many participants might cover similar animals overall, we want to investigate animal dependencies locally, and we don’t expect individuals to always recall animals in the same memory “area” the whole time.
Edge Connective Efficiency and Diversity
To compare the results of different backboning techniques, we introduce a new simple measure to quantify a network’s sparsity, in terms of how many edges more than \(n-1\) (the minimum number needed to connect all \(n\) nodes) are being used. \[ \begin{gathered} \psi_n(e) = \frac{e_{\text{min}}}{e}\frac{e_{\text{max}}-e}{e_{\text{max}}-e_{\text{min}}}\\ e_{\text{max}} = \frac{n(n-1)}{2} \quad e_{\text{min}} = n-1 \end{gathered} \tag{9.3}\]
We call \(\psi\) the graph’s “connective efficiency”, and it will range from 0 when the graph is fully connected, to 1 when it is a tree, to >1 when it has insufficient edges to be connected. This measure is intended to compare graphs that have had edges removed until it is about to be disconnected, such as with the Doubly Stochastic Filter (DS) [28]. However, values greater than 1 also give insight to how much sparser than a tree some graph is.
The DS animal network is shown in Figure 9.5. With large, deeply connected clusters centered around contexts animals are found in Figure 9.5 looks very similar to the network recovered to make Fig. 4 in [25]. Clusters approximate animal types by their location relative to humans: farm/livestock, ocean/water creatures, “African” and jungle animals, small indoor mammals, small outdoor mammals, etc.
This is largely how the literature on semantic fluency leaves their network recovery solution, with clusters based on human proximity or physical location. However, there are other ways people might relate animals than simply by location. Additionally, clique bias is quite strong in this network: why must every farm animal be mutually connected if its possible to recall any of them through one or two “hub” animals? This is related to the incredible inefficiency of this backbone, with a \(\psi=0.35\) being rather closer to fully connected than sparse. Additionally, the two animals that seem to appear regardless of others are cat and dog, which ironically makes DS penalize their proximity to any of the clusters. Both are ironically clustered with ocean animals due to their tendency to be listed near fish.
The Chow-Liu tree network is shown in Figure 9.6, and goes some way to alleviating these issues. Clusters are largely intact, instead represented by large branches/subtrees off the main group. However, some community relationships have been sacrificed to maintain strong individual edges, such as monkey+giraffe for location similarity at the expense of separating two halves of the pink cluster across a wide distance. More alternate paths between creatures (i.e. loops) are needed to better represent our perception of animal relationships.
The other dependency network recovery method is GLASSO, which we have similarly thresholded at the minimum-connected point. It only slightly improves on connective efficiency (\(\psi=0.44\)), though the cliques are replaced with much more dispersed connections throughout the graph. We also see that reasonable inter-group connections are better represented, such as rabbit+squirrel, though cat and dog are still isolated due to overrepresentation throughout the dataset.
Subjectively, the GLASSO network is still difficult for an analyst to synthesize into useful knowledge, with so many edges, while the DS network only really managed to communicate one “kind” of knowledge (the context clusters). We would ideally prefer a backbone that provides a wider diversity of important edge “types”, for an analyst to better understand the kinds of animal relationships humans perceive.
To illustrate this, we show the Forest Pursuit(FP) network in Figure 9.8. It has also been filtered to minimum-connected, like DS and GLASSO, though in this case the connective efficiency to reach that threshold is a staggering \(\psi=0.88\).
Unlike the other networks that push “generalist” nodes like cat and dog onto long, distant chains, those chains are used in the FP network to hold rare subgroups of clusters, treating them as “gated” by the prominent “hubs” of those groups. For example, cat is correctly linked to mouse and lion (in addition to bird), while lemur is pushed down a longer chain of primates, “gated” by gorilla. Similarly with eel through lobster and crab.
A much broader edge-type diversity is also made apprarent with many non-context-based relationships made obvious with the improved sparsity. An analyst has an easier job of creating “edge-type inventories”, making the FP backbone an excellent exploratory assistant: animals can be related because they are:
- Co-located
- Taxonomically similar (cheetah+leopard)
- Famous predator/prey pairs (cat+mouse)
- Pop-culture references (lion\(\rightarrow\)tiger\(\rightarrow\)bear)7
- Similar in ecological niche/role (koala+sloth)
- lexically similar/rhyming (moose+goose)
- Related through conservation or public awareness (panda+gorilla)
- etc.
7 Note that the dependency-based methods correctly interpred these three as not being mutually connected in a triad, but specifically with this ordering (tiger in the middle).
This is further reflected in how FP alters the way centrality measures behave. Replicating Figure 8.7 for these graphs, Figure 9.9 shows the change in rank across the top 15 animals for the DS, GLASSO, and FP networks.
The DS centrality finds the most densely connected clique and gives all of its members incredibly high values. Meanwhile, the none of the top 5 most common animals (dog,cat,lion,tiger,bear)8 have high centrality at all. Both GLASSO and DS have farm animals (chicken,goat,cow) as the most important, despite the idea that goat likely could be reached from e.g. cow quite often. FP adds more variety, giving hub-animals from different communities high centrality scores, each of which could lead to a variety of different paths. While subjective, the ranks from FP appear to be a more holistic inventory of “lynchpin” animals, which provide nearby coverage for a large amount of others.
8 Interestingly, dog never appears in centrality measures, and none of the networks connect dog to any other animal than cat. Meanwhile, wolf is associated more with fox, coyote, dingo, etc., which are notably all predators of farm animals.
With the primary community structures being what they are (context/location-based), it seems that humans tend to put dogs in a category all their own.
Thresholded Structure Preservation
Another beneficial feature of Forest Pursuit is how it fares under increased thresholding. Because co-occurrence methods prioritize cliques, those cliques will remain connected as other edges are removed, effectively destroying the global structure of the animal network past a certain threshold. As seen in Figure 9.10, in keeping only the top 2% of edges, they are used to connect separated islands of animal communities.
Meanwhile, the FP network is quite robust to this excessive thresholding, with the global structure preserved at 2%. The removed edges have simply detached some of the rarer animals from the network entirely. These isolates not only reflect a more metrologically-sound idea (that rarer nodes would be disconnected at higher certainty thresholds), but are also beneficial to analysts, since manually re-connecting isolates of rare animals is simpler than manually determining a global reconnection strategy for island groups.
Forest Pursuit as Preprocessing
Because Forest Pursuit creates a representation of the observed data in edge space, we can use Equation 6.4 in the forward direction, creating a “new”design matrix \(X\gets BR\). As discussed in Problem Specification, Equation 6.4 will create a design matrix of interaction counts for each node (its degree in the steiner tree approximation), rather than a binary “on/off” indicator.
By supplying other algorithms with this new estimate for \(X\), this makes FP a kind of preprocessing on the data itself. We can do this to bias the other methods toward greater sparsity in the backbone, without explicitly relying on point estimates for each observation (any tree with those node degrees would do the same). As shown in Figure 9.11, GLASSO and DS both increase their connective efficiency under FP preprocessing.
This is likely due to the increase in “signal-to-noise” ratio for each datapoint, since observations are only similar when they have a node interacting the same “amount”, not merely when similar nodes are activated.
Because the entries are now integer counts, FP preprocessing might also give a better path to using distribution-based embedding and clustering techniques, such as Hellinger distances between multinomial sample counts. This goes for many other techniques from text processing that rely on multinomial assumptions (i.e. techniques otherwise inapplicable to binary data).
FP Preprocessing, with further empirical and theoretical validation, might prove to be a powerful tool for practitioners to flexibly backbone and analyse their networks with a variety of new techniques.