Vector representations of incidence
“By stripping away all the extra baggage of distance, length, angle, continuity, betweenness, etc. and retaining only the notion of incidence, we find that what remains is still quite fascinating and highly nontrivial.”
– G. Eric Moorehouse
To provide a sufficiently rigorous foundation for network recovery from binary occurrence, we will need a rigorous way to represent networks and occurrences that lends itself to building structured ways both connect to each other. We build on the incidence structure and matrix product formalism from the previous chapter, introducing various ways to build graphs as incidence structures that have direct representations as matrices. This can be extended to representing occurrences as matrices of hyperedge vectors. This view allows us to interpret different representations of graphs (or hypergraphs) as connected by simple matrix operations.
Traditionally[1], [2], we might say a graph on nodes (or, “vertices”) \(v\in V=\{1,\cdots,n\}\) and “edges” \(E\) is a tuple: \[ G=(V,E) \quad \textrm{s.t.} \quad E\subseteq V \times V \]
1 The indicator function \(\mathbf{1}_A(x)\) is 1 for values of \(x\) in the set \(A\), and 0 otherwise.
In incidence geometry terms, this would be similar to making two duplicate sets of the same nodes, and defining a graph as the set of incidences between nodes. The adjacency matrix \(A\) of \(G\), degree matrix \(D\), and graph/discrete Laplacian \(L\) are then defined as:1 \[ \begin{aligned} A(u,v) & =\mathbf{1}_E((u,v)) \quad &A : V\times V\rightarrow \mathbb{B} \\ D(u,v) & =\mathrm{diag}({\textstyle\sum}_V A(u,v))\quad &D : V\times V\rightarrow \mathbb{N} \\ L(u,v) & = D(u,v) - A(u,v) \quad &L : V\times V\rightarrow \mathbb{Z} \end{aligned} \]
However, if edges and their recovery is so important to us, defining them explicitly as nodes-node incidences can be problematic when we wish to estimate edge existence (or not), given noisy pairs of node co-occurrences. Additionally, we have to be very careful to distinguish whether our graph is (un)directed, weighted, simple, etc., and hope that the edge set has been filtered to a subset of \(N\times N\) for each case. Instead, we propose a less ambiguous framework for vectorizing graphs, based on their underlying incidence structure.
Graphs as incidence structures
Instead, we give edges their own set of identifiers, \(e\in E=\{1,\cdots \omega\}\). Now, define graphs as incidence structures between nodes and edges such that every edge is incident to either two nodes, or none:
\[ G = (V,E,\mathcal{I}) \quad s.t. \quad \mathcal{I} \subseteq E\times V \tag{3.1}\]
Variations on graphs can often be conveniently defined as constraints on \(\mathcal{I}\):
- Self loops can be prohibited by only allowing unique flags for a given relation2
- Multigraphs are similarly described by whether we allow pairs of vertices to appear with more than one edge3
2 never two flags with the same pair, i.e., \(\mathcal{I}\) is a set, not a multiset.
3 in the set of flags containing nodes \(u\) or \(v\), only one \(e\) may be incident to both of them.
Together, these constraints define “simple” graphs. Similarly, we can equip Equation 3.1 with a function \(B\) that allows \(\mathcal{I}\) to encode information about the specific kinds of incidence relations under discussion. We give \(B\) a range of possible flag values \(S\):
\[ G = (V,E,\mathcal{I},B) \quad s.t. \quad \mathcal{I} \subseteq E\times V\quad B:\mathcal{I}\rightarrow S \tag{3.2}\]
- Undirected, unweighted graphs only need single elements: “incidence exists” i.e., \(S=\{1\}\)
- Directed graphs can use two elements e.g., a “spin” for \(S=\{-1,1\}\)
- Weighted, undirected graphs are supported on positive scalars e.g., \(S=\mathbb{R}^+\)
- Weighted, directed graphs are supported on any scalar e.g., \(S=\mathbb{R}_{\neq0}\)
If the “absence” of incidence needs to be modeled explicitly, a “null” stand-in (0,False) can be added to each \(S\), which is useful for representing each structure as arrays for use with linear algebra (i.e., \(\{0,1\}\),\(\{-1,0,-1\}\),\(\mathbb{R}^+_0\), and \(\mathbb{R}\), respectively). By doing so, we can also place an exact limit on the maximum possible size of \(\omega=\|E\|\) in the simple graph case, and indicate edges by their unique ID, such that \(\mathcal{I}= E\times V\) is no longer a subset relation for \(E=\{1,\cdots,{n\choose2} \}\). Instead of existence in \(\mathcal{I}\), we explicitly use incidence relation \(S\) to tell us whether each possible edge “exists” or not, simplifying our graph definition further4:
4 if we allomulti-edges , then
\[ \begin{gathered} G = (V,E,B) \quad s.t. \quad B : E\times V \rightarrow S\\ v \in V = \{1,\cdots, n\}\quad \\ e \in E = \left\{1,\cdots, {n\choose 2} \right\} \end{gathered} \tag{3.3}\]
The representation of \(B\) in Equation 3.3 bears a remarkable similarity to our original description of design matrices in Equation 2.1. In fact, as a matrix, \(B(e,v)\) is called the incidence matrix: every row has two non-zero entries, with every column containing a number of non-zero entries equal to that corresponding node’s degree in \(G\). Traditionally, we use an oriented incidence matrix, such that each row has exactly one positive (non-zero) value, and one negative (non-zero) value.5 Even for undirected graphs, the selection of which entry is positive or negative is left to be ambiguous, since much of the math used later is symmetric w.r.t. direction6.
5 In fact, this would make B^*(v,e) equivalent to a graphical matroid, another common formalism that generalizes graph-like structures to vector space representations.
6 though not always!
Embedding incidences in vector space
A formalism for graphs that starts with incidence matrices would benefit from a canonical oriented incidence matrix, rather than the family that is ambiguous w.r.t. edge orientation. To start, we can be more precise by selecting each row(edge) vector, and partitioning it into two: one for each non-zero column (node) that edge is incident to. Every incidence can be represented individually as standard basis vector \(\mathbf{e}\) in the feature space of \(B\), scaled by the corresponding value of \(S\).
Let \(V_e\) be the set of nodes with (non-zero) incidence to edge \(e\). Then the incidence vectors are:
\[
\delta_e(v) = B(e,v)\mathbf{e}_v \quad \forall v\in V_e
\tag{3.4}\] And the (unoriented) incidence matrix vectors are recovered as sums over the incidence vectors for each edge: \[
\mathbf{b}_e = \sum_{v\in V_e} \delta_e(v)
\tag{3.5}\]
A traditional approach might then define undirected graphs as equivalent, in some sense, to multidigraphs, where each edge is really two directed edges, in opposing directions. This does allow the matrix \(B\) to have the correct range for its entries in this formalism (the directed graph range, \(S=\{-1,0,1\}\)), and the edge identity based on sums would hold. However, the resulting set of incidences would have twice the number of edges than our combinatoric limit for simple graphs, and prevent the more elegant definition of graph types through the set \(\mathbf{S}\). Plus, it would necessitate averaging of weights over different edge ID’s to arrive at a single undirected “edge weight”, and many other implementation details that make keeping track of specifics difficult for practitioners.
Instead, we would like a canonical oriented distance matrix, which can be derived from the vectorized incidences in the undirected range of \(B\) (the standard basis vectors). Without loss of generality, let \(u_e,v_e\in V_e\) be nodes such that \(u<v\).7 Using this, we can unambiguously define a partition \(B(e,\cdot)=B(e,u_e) + B(e,v_e)\) between incidences on \(e\), along with a new derived incidence, \(B_o\), which has oriented rows like: \[B_o(e,\cdot)=\mathbf{b}^o_e = \delta_e(u)-\delta_e(v)\] In other words, while the unoriented incidence matrix is the “foundational” representation for graphs in our formalism, the (canonical) oriented one can be derived, even if negative incidence values are not in \(\mathbb{S}\).8
7 the inequality is strict because self-loops are not allowed.
8 This works as long as we are in at least a ring, since semirings in general do not need to define additive inverse operations. In this case we would limit ourselves to the oriented incidence.
\[ \small B^{\omega \times n} = \begin{array}{c c} & \begin{array}{cccccccccc} a & b & c & d & e & f & g & h & i & j\\ \end{array} \\ \begin{array}{c c } e_1\\e_2\\e_3\\e_4\\e_5\\e_6\\e_7\\e_8\\e_9\\e_{10}\\e_{11}\\ \vdots \\e_\omega\\\end{array} & \left[ \begin{array}{cccccccccc} 1 & . & . & . & . & 1 & . & . & . & . \\ 1 & . & . & . & 1 & . & . & . & . & . \\ . & . & . & . & 1 & 1 & . & . & . & . \\ . & . & . & . & . & 1 & . & . & 1 & . \\ . & . & . & . & 1 & . & 1 & . & . & . \\ . & . & . & . & 1 & . & . & 1 & . & . \\ . & 1 & . & . & . & . & . & . & . & 1 \\ . & . & . & . & . & . & . & . & 1 & 1 \\ . & . & 1 & . & . & . & 1 & . & . & . \\ . & . & . & . & . & . & 1 & 1 & . & . \\ . & . & . & 1 & . & . & . & 1 & . & . \\ &&&& \vdots &&&&& \\ &&&& \vdots &&&&& \end{array} \right] \end{array} \]
But, now that we have removed the information on “which nodes an edge connects” from our definition of edges (since every edge is a scalar ID), how do we construct \(V_e\) without a circular dependency on \(B\) to find non-zero entries? Because of our unique identification of edges up to the combinatoric limit, we can still actually provide a unique ordering of the nodes in \(V_e\), without searching over the entirety of \(B\)’s domain. Using an identity from [3], we have a closed-form equation both to retrieve the IDs of nodes \(u,v\) (given an edge \(e\)), and an edge \(e\) (given two nodes \(u,v\)), for any simple graph with \(n\) vertices. \[ \begin{gathered} u_n(e) = n - 2 - \left\lfloor\frac{\sqrt{-8e + 4n(n - 1) - 7}-1}{2}\right\rfloor\\ v_n(e) = e + u_n(e) + 1 -\frac{1}{2} \left(n (n - 1) + (n - u_n(e))^2 - n+u_n(e)\right)\\ e_n(u,v) = \frac{1}{2}\left(n(n - 1) - ((n - u)^2 - n+u)\right) + v - u - 1 \end{gathered} \tag{3.6}\] Our ease-of-calculation lets us drop some of the excess notation and refer to our (un)oriented incidence matrices in terms of the incidences of each edge on their \(u\) or \(v\), directly. \[ B = B_u + B_e \qquad B_o \equiv B_u - B_v \]
Inner products on \(B\)
With all of this background, the other representations of graphs can be seen as derivations from the canonical incidence matrices. The Laplacian, which is usually introduced either in terms of ajacency/degree, or as the gram matrix for oriented edge vectors, is also related to the gram matrix between all pairs of incidences on \((u,v)\). The other identities are simply consequences of the polarization identity: \[ \begin{split} L & = B_o^TB_o\\ & = \|B_u - B_v \|^2 \\ & = 2\|B_u\|^2 +2\|B_v\|^2 - \|B_u + B_v \|^2 \\ & = 2D - B^TB = D-A \end{split} \tag{3.7}\]
The Laplacian is often used in defining random-walks and markov chains, such that the degree of each node should be normalized to 1, which can be accomplished either by row- or column-normalizing it: \(L^{\textrm{rw}}=D^{-1}L\) or \(LD^{-1}\), respectively. If this degree-normalization is desired without de-symmetrizing \(L\), we can still perform an operation similar to normed kernels in Equation 2.7, called the symmetric normalized Laplacian. \[ \tilde{L} = D^{-\tfrac{1}{2}}LD^{-\tfrac{1}{2}} = \frac{L(u,v)}{\sqrt{L(u,u)^2L(v,v)^2}} \tag{3.8}\]
Edge Metrology, Edge Vectors
Implicitly in the use of \(B\) with design matrix mechanics from the previous chapter is a treatment of edges as “observations” (the data space), and nodes as features. If an edge is an observation, then unfortunately we cannot really quantify uncertainty over repeated measurements of edges with the simple mechanics from Additive Smoothing (because that edge is that corresponding observation, and IDs are not duplicated).
So far we have seen two ways of representing the entire graph in matrix form: Incidence matrix \(B\) (or \(B_o\)), and the inner-product matrices derived from it (\(L\), \(A\)). Since we can recover node IDs from edge IDs by Equation 3.6, we can use a single vector to represent an entire graph structure by it’s edges alone. Then a data dimension for vectors in “edge space” can once again represent observations, with nodes implied by Equation 3.6. This is either done by contracting \(B\) along the nodes (columns) dimension, or by unrolling the upper triangle of \(L\) or \(A\) according to Equation 3.6.9 If each vector represents a value of \(B\) associated with a corresponding edge, then \(m\) of these vectors would be equivalent to \(m\) observations of \({n \choose 2}\) edges on the same set of \(n\) nodes. Formally, for the \(i\)th observed structure on \(n\) nodes: \[ \begin{gathered} R(i,e) = \min(\{B_i(e, u_n(e)),B_i(e,v_n(e))\}) \\ \quad \textrm{s.t.} \quad R:I\times E \rightarrow \mathbb{S}\\ i\in I = \{1,\cdots,m\} \qquad e \in E=\left\{1,\cdots,\omega\right\}\\ n=\tfrac{1}{2}(1+\sqrt{8\omega+1}) \end{gathered} \tag{3.9}\] This representation formalizes what practitioners call “edgelists” into a data structure that can unambiguously distinguish directed, undirected, and weighted graphs. In addition, it allows for repeated measurements of edges over the same set of nodes, while flexibly growing when new nodes arrive.10 We are now able to encode the kinds of observations our hypothetical social scientist would be making of author collaboration interactions as vectors, shown in Figure 3.2.
9 Equation 3.9 uses an averaging operation to accomplish the contraction, but any reduction over the two nodes shared by an edge would accomplish the same, especially since we rarely see separate values for edge weight per-node, the way incidences can.
10 For instance, say observations are stored as sparse entries via \(R\), and a new node arrives. First, the participating nodes can be recovered in a vectorized manner through Equation 3.6. Then, a new node id increases \(n\), followed by reassignment of the edge IDs with \(e_n(u,v)\).
\[ \small %X(\{1,2,3,4,\cdots\})=\\ R^{n\times\omega}= \begin{array}{c c}\small & \begin{array}{ccccccccccc} e_1 & e_2 & e_3 & e_4 & e_5 & e_6 & e_7 & e_8 & e_9 & e_{10} & e_{11}\\ \end{array} \\ \begin{array}{c c } x_1\\x_2\\ \vdots \end{array} & \left[ \begin{array}{ccccccccccc} \ .\; & \;. \; & \;. \; & \;. \; & \;1 \; & \;. \; & \;. \; & \;. \; & \;1 \; & \;1 \; & . \ \\ \ .\; & \;. \; & \;. \; & \;1 \; & \;. \; & \;. \; & \;1 \; & \;1 \; & \;. \; & \;. \; & . \ \\ &&&& \vdots &&&&& \end{array} \right] \end{array} \]
Node activation, bipartite graphs, and hypergraphs
What if an incidence structure allows for more than two incidences for the “line” set? In our binary design matrix, we might consider each observation its own “line”, such that it is incident to all the activated nodes. This is no longer a graph of edges and nodes, but rather a more general object.
Every incidence structure can be seen as an incidence matrix, which additionally can be thought of as a bipartite graph. In this sense, the incidence matrix is thought of as a bi-adjacency matrix, which is a subset of a larger adjacency having two off-diagonal non-zero blocks
\[ A_{BP} = \begin{pmatrix} 0_{n,n} & X^T \\ X & 0_{m,m} \end{pmatrix} \]
The graph having this adjacency structure has two sets of nodes that do not intraconnect (ergo, “bipartite”). The resulting structure for our toy example is shown next to the incidence matrix in Figure 3.3.
\[ \small X^{m\times n}= %X(\{1,2,3,4,\cdots\})=\\ \begin{array}{c c} & \begin{array}{cccccccccc} a & b & c & d & e & f & g & h & i & j\\ \end{array} \\ \begin{array}{c c } x_1\\x_2\\x_3\\x_4\\ \vdots \\ x_m\end{array} & \left[ \begin{array}{c c c c c c c c c c} . & . & 1 & . & 1 & . & 1 & 1 & . & . \\ 1 & . & . & . & 1 & 1 & . & . & . & . \\ . & 1 & . & . & . & 1 & . & . & 1 & 1 \\ . & . & . & 1 & 1 & . & . & 1 & . & . \\ &&&& \vdots &&&&& \\ &&&& \vdots &&&&& \end{array} \right] \end{array} \]
When the set of lines is a family of subsets of points, incidences on points and lines form a hypergraph. Hypergraphs are usually thought of as graphs where edges can connect more than two nodes, which again can be made into an incidence structure in the same way our graphs are. This isomorphism lets many of the familiar ideas on graphs (e.g., walks, paths, laplacians, etc.) be reformulated in terms of hypergraphs. For a more in-depth elaboration on algebraic graph theory on graphs, see [4].
Inner product on Hyperedges
Rather than go into similar detail about hypergraphs, we want to focus on the implication of performing inner product calculations with vectors of the hypergraph. \(\mathbf{x}_i^T\mathbf{x}_{i'}\) will not necessarily result in a binary value, but a sum over all shared nodes in each edge. As mentioned previously, the Gram matrix for \(X\) will count co-occurrences in the off-diagonals, with marginal counts on the diagonal. This means that the relation between edge-vector entries and node-activations is not so straight forward: each entry has a “magnitude”, so forcing a binary edge activation from the adjacency or laplacian forms (using Equation 3.6) will necessarily lose information. This information loss is one of the key precautions practitioners are advised to consider in [2], since transformation between any of the representations for complex systems will change the information encoded by them.
Combining Occurrence & Dependence
As an aside, there are several ways that known graph structures can be exploited to regularize distance or association measures for hypergraphic data. Graphs of dependency information can provide useful kernels on the graph.11 For an in-depth analysis of common kernels on graphs, including the now-famous PageRank, Heat, and Commute-time kernels, see [5]. In [6] the “commute time” kernel is said to generate a “generalized euclidean distance”, and using an adjacency matrix kernel before normalization by marginals gives what is known as “soft cosine” measures, which incorporate prior knowledge e.g., for semantic similarities between words.
11 Note that a “kernel on a graph” works by defining distances between weighted vectors of nodes. A “graph kernel,” on the other hand, describes a kernel that defines distances between different graphs.
Importantly, each of these methods assume that a network structure is already available. What can often happen in practice, however, is the use of a filtered/thresholded version of the hypergraphic Gram matrix as the network, because no a priori network is available. This runs the risk of over-estimating the proximity of nodes in similar regions of node-space, reinforcing local correlations at the expense of long-distance path estimates. A more formalized analysis of this problem, which we term “clique bias,” can be found in The Gambit of the Inner Product.
As covered in the introduction, finding the underlying dependency network from the hypergraphic data is the core concern of network recovery. Reliably recovering this structure would then allow for less biased use of these graph kernels in real-life application.