Qualitative Application of Relationship Recovery

“Please tell me what this is all about, papa. Please, please tell me what this is. Please, what is it?”

– Cosette, Les Misérables (1998)

A more realistic test of Forest Pursuit’s behavior is in unsupervised, qualitative analysis of complex systems and networks. In this chapter we will recreate two well-known bipartite datasets from network science, and compare several network recreation methods on them. As we have discussed previously[1], there is some interest in quantifying network assortativity for social networks, and whether the positive assortativity effect is a methodological artifact. Assuming an analyst is interested in social network dependencies—like which nodes are directly influencing which others—we provide a glimpse at how correcting for clique bias with Forest Pursuit (FP) affects assortativity.

[1]

D. N. Fisher, M. J. Silk, and D. W. Franks, “The perceived assortativity of social networks: Methodological problems and solutions,” in Trends in social network analysis, Springer International Publishing, 2017, pp. 1–19. doi: 10.1007/978-3-319-53420-6_1.

In the second case study, we also investigate how centrality measures can be adversely affected when clique-bias is not accounted for, especially when appearance in cliques is known to be rare for certain node types.

Network Science Collaboration Network

Following in the steps of [2], we recreate a network from co-authorships in works cited by two literature review papers, [3] and [4]. The list of all cited papers was reconstructed via web-of-science queries, and these were used to construct author “activation” observations.¹ Nodes and papers were retained only if they had two or more papers and authors listed, respectively. A baseline co-authorship network was initially constructed using the largest connected component, and a modularity-maximizing community detection algorithm [[6];Findingevaluatingcommunity_Newman2004] was used to highlight community structure. As recommended in [2], we also show a reduced course-grained “quotient” graph of the communities and their interconnection to help visualize and understand the large-scale structure each network presents.
This baseline network can be seen in Figure 8.1, where (to keep with the treatment in [2]) we have added a quotient network of the communities themselves.

[3]

M. E. J. Newman, “The structure and function of complex networks,” SIAM Review, vol. 45, no. 2, pp. 167–256, Jan. 2003, doi: 10.1137/s003614450342480.

[4]

S. BOCCALETTI, V. LATORA, Y. MORENO, M. CHAVEZ, and D. HWANG, “Complex networks: Structure and dynamics,” Physics Reports, vol. 424, no. 4–5, pp. 175–308, Feb. 2006, doi: 10.1016/j.physrep.2005.10.009.

¹ An additional paper ([5]) was added to ensure the the largest connected component contained necessary communities that were originally missing compared to [2].

[5]

K. Sneppen and M. E. J. Newman, “Coherent noise, scale invariance and intermittency in large systems,” Physica D: Nonlinear Phenomena, vol. 110, no. 3–4, pp. 209–222, Dec. 1997, doi: 10.1016/s0167-2789(97)00128-0.

[2]

M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E, vol. 69, no. 2, p. 026113, Feb. 2004, doi: 10.1103/physreve.69.026113.

[6]

A. Clauset, M. E. J. Newman, and C. Moore, “Finding community structure in very large networks,” Physical Review E, vol. 70, no. 6, p. 066111, Dec. 2004, doi: 10.1103/physreve.70.066111.

Figure 8.1: 134 Network scientists connected by co-authorship

Figure 8.2: Chow-Liu tree of NetSci collaborator dependency relationships

The detected assortativity \(r=0.059\) is small, but positive. As expected, each community tends to have a dominant “clique”, since many members of these groups tend to mutually author papers together (simultaneously).

However, if we are after a “social network” that describes social influence of individuals with respect to others, then we are hoping to recover collaborative dependency relationships.

Who is causing who to join papers?
Who are the central “collaborators” that influence the paper writing of many others?

With some background knowledge of how university labs tend to work (with students, post-docs, advisors, colleagues, etc.), we might not believe that every member of a community has significant influence on the writing of every other member. So, to estimate a dependency network of collaboration relationships, we could turn to a Chow-Liu tree, which is shown in Figure 8.2.

This dependency is much more sparse, and indeed the assortativity coefficient has dropped to \(r=-0.251\). In academic writing, we might even expect a lowered assortativity, since students that may not go on to be prolific in their original field would still seek out prolific advisors, initially. Unfortunately, enforcing a tree structure has some negative side effects. Overlaying the original community structure from Figure 8.1 onto the tree and calculating a quotient graph shows that the community structure is impacted by the change. Many of the communities are “unrolled” into long chains of authors, since trees cannot allow small loops or cycles. This makes for a shortest-path distance between community hubs (in the same field) to be upwards of 9-10 jumps, which goes against the scale-free/small-world nature we expect from social systems.

Figure 8.3: Forest Pursuit estimate of NetSci collaborator dependency relationships

With Forest Pursuit, we can attempt to correct for “clique bias” in the measurement of dependency relationships, while allowing for flexibility in the global network structure. The FP recovered collaborator graph is shown in Figure 8.3.

The communities from the co-occurrence network are entirely preserved, with a nearly identical community structure in the graph quotient compared to the co-occurrence graph. But now we see an assortativity that is close to zero (slightly negative at \(r=-0.069\)). This brings the assortativity more in line with the World Wide Web networks, or even close to results for random preferential attachment networks that are zero in the limit[7].

[7]

M. E. J. Newman, “Assortative mixing in networks,” Physical Review Letters, vol. 89, no. 20, p. 208701, Oct. 2002, doi: 10.1103/physrevlett.89.208701.

Relative to the tree network, there are not so many long chains, as many authors have been allowed to “loop-back” and have relationships with nearby colleagues, reducing the path distance between communities in the process. Still, like the trees, FP tends to reduce the degree of each node, which is summarized in Figure 8.4.

Figure 8.4: Degree distributions of FP vs co-occurrence social networks

From a domain modeling perspective, it might be reasonable to assume that any given author has 2-3 influential collaboration relationships shown, especially considering students and post-docs are likely to constitute a good number of nodes. Even from a logistics perspective (during a given time window working with an advisor) the number of times a student asks/is asked to participate in a paper has to be limited, given average publishing rates.
Only rare individuals would have upwards of 10 influential relationships, with a mean closer to the 2-3 of the FP network, rather than a full quarter having 10 and the median being 4-5 (as in the co-occurrence network). In addition, recall that this is a social network derived from literature review paper citations, not exhaustive inventories of each author’s work, so every influential relationship should not be shown in this sample.

So, which network is preferable? It depends.

The premise of FP is to find a max. likelihood network such that the observed activations (authors on a paper) arise from a random walk along dependencies (author relationships of influence causing others to join). If influential (dependency) relationships are being measured, then the logistics of maintaining 2-3 of them seems more realistic than 5-10 (especially only from the sampled papers). Additionally, we expect students having few influential relationships would attach preferentially to advisors with more, so a positive assortativity might be problematic. FP agrees with these conclusions, making it consistent with that set of a priori domain beliefs vs. the co-occurrence network alone. These kinds of qualitative assessments are not rigorous, but the modeling questions required to choose FP (or not) go a long way to reducing error in trueness. They help modelers validate whether the recovered network was made in a consistent way with their domain knowledge.

Les Misérables Character Network

Another famous network derives from [8] via mining character co-occurrences in chapters of Les Misérables (1862), Victor Hugo’s sprawling saga of [in]equality and [in]justice in 19th-century France. This dataset is often reported as being inherently a graph, but we have reconstructed the underlying bipartite observations for the purpose of removing clique bias from the network reconstruction. In the original network, an extra count is added to the weight of an edge every time a character occurs in a chapter with another. Once again, we have only retained characters/chapters that appeared with two or more chapters/characters, respectively. In Figure 8.5, we run the same community detection scheme to improve the visual understandability of the network, along with inset quotient graphs.

[8]

D. E. Knuth, The stanford GraphBase: A platform for combinatorial computing, vol. 1. AcM Press New York, 1993.

Figure 8.5: Les Miserables character co-occurrence network

Note that the communities demonstrate clear clique-like behavior. On one hand, this does make assessment of clusters easier, since characters in roughly the same scenes are densely grouped together. However, this network makes it difficult to parse relationships, such that backboning would become necessary.

We should also ask what we want out of this “social” network. Another way we might think about a social network of characters is “which characters influence the appearance (or not) of which other characters?” By extension, “which characters are significant, in the sense that they dictate the appearance of more characters than others?” From a domain modeling perspective, this is like asking how an author is deciding which characters to include in a chapter. By not correcting for clique-bias, we are implicitly assuming that Victor Hugo would be setting out to write a chapter and immediately writes down a list of every character, independently, should appear. Instead, it’s likely that the desired appearance of certain characters in a chapter leads to the inclusion of others, as the plot requires.

Because the inclusion of certain characters leads to the inclusion of others (by this model), we can reasonably model the authorial “character inclusion process” as spreading from character to character. With this in mind, we apply Forest Pursuit to correct for clique-bias, and show the resulting dependency network (with the original community partition) in Figure 8.6.

Figure 8.6: Les Miserables character dependency network (Forest Pursuit)

The network edge probabilities have been thresholded in the same manner as the DS (minimum connectivity) filter, only including necessary edges to still retain overall connectivity. As before, our community structure shown in the quotient graph has been preserved, even with the significant edge density reduction.

Applying these networks, we might wish to distinguish through them which characters are “main” and which are “supporting”, though more on a gradient than in a binary/classification sense. One way to do this is through centrality measures [9], [10], which estimate the “importance” of nodes in various ways. Eigenvector centrality, specifically, finds the importance of nodes relative to the importance of each node’s neighbors² It does this by estimating the stationary distribution of a random walk on the network (what is the probability after many jumps of ending up in each node?)

[9]

M. Newman, Mathematics of networks. Oxford University Press, 2018. doi: 10.1093/oso/9780198805090.003.0006.

[10]

M. Coscia, The atlas for the aspiring network scientist. Michele Coscia, 2021. Available: https://arxiv.org/abs/2101.00863

² Regularized eigenvector centrality is equivalent to PageRank centrality, i.e. the seminal work from [11], of Google fame.

[11]

L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: Bringing order to the web.” Stanford InfoLab; Stanford InfoLab, Technical Report 1999-66, Nov. 1999. Available: http://ilpubs.stanford.edu:8090/422/

[12]

S. Fortunato, M. Boguñá, A. Flammini, and F. Menczer, “Approximating PageRank from in-degree,” in Algorithms and models for the web-graph, Springer Berlin Heidelberg, pp. 59–71. doi: 10.1007/978-3-540-78808-9_6.

The topology of a network greatly impacts the centrality measure recovered. In fact, a well-known result shows that much of the variance in eigenvector/PageRank centrality values can be explained with node in-degree alone.[12] Of course, we do tend to see nodes with high-degree as hubs, and more important for it—that is, we would if we weren’t so concerned with having to deal with clique bias in the network. The tendency of eigenvector centrality to rely on degree (and therefore reward densely-connected cliques with high importance) prevents us from using it over more expensive methods like betweenness centrality. Instead, we might be able to differentiate between high-degree nodes in cliques and “actual” hubs by controlling for clique bias ahead of time with Forest Pursuit.

In Figure 8.7 we show a bump-plot of the ranks of the top 15 most central nodes for both the original co-occurrence network and the FP network estimate. In the original network, centrality rewards characters appearing in mutual cliques as being central, especially the “revolutionary” group (shown in orange in Figure 8.6 and Figure 8.5). Are these really the most important characters? Important characters like Cosette and Fantine are nowhere near the top. We wish to find important characters in the sense that they have influence over the appearance of many other characters. In that case, even Javert (though being an important recurring character) largely plays the role of generic law enforcement throughout, not necessarily pulling other characters in or out of scenes with him as much as others like Marius. More than Javert, Marius plays central motivating roles across several distinct social groups throughout the novel. All of these problems are corrected to some degree in the FP estimate.

Figure 8.7: Changes in character centrality ranking for FP vs co-occurrence

Why are the centrality levels for Fantine, Eponine, and Cosette so much higher in FP? Why were they so low in the first place? This brings up another benefit of correcting clique-bias: estimating character importance despite systematic interaction-reduction bias from authors. As made famous by the so-called “Bechdel test” [13], authors tend to systematically reduce the agency and interaction rates of female characters, an effect that has been studied from many angles covering multiple centuries [14]. We highlight every female character in Figure 8.7 (bold/thicker paths), which shows that the co-occurrence network had one single female character in the top 15 most central (Eponine, ranked 14th). This is because their degrees are overshadowed by large cliques! By correcting for clique-bias, every female character significantly increases in centrality ranking This way, the occurrence of many characters can be explained through these female characters, rather than assuming each character separately connects with every other, biasing importances toward large groups of men.

[13]

C. van Raalte, “1. No small-talk in paradise: Why elysium fails the bechdel test, and why we should care,” in Media, margins and popular culture, 1st ed., H. Savigny, E. Thorsen, D. Jackson, and J. Alexander, Eds., London: Imprint: Palgrave Macmillan, 2015.

[14]

O. Stuhler, “The gender agency gap in fiction writing (1850 to 2010),” Proceedings of the National Academy of Sciences, vol. 121, no. 29, Jul. 2024, doi: 10.1073/pnas.2319514121.

In context, titles of entire books and volumes of Les Misérables are named after women we would consider crucial to the plot (e.g. Fantine, Cosette, and Eponine).
As [14] indicates, however, authors are likely to under-represent their agency and interactions with other characters throughout the story, so networks based primarily on co-occurrence will likewise systematically underestimate their importance.