Complexity Reduction as a Complement to Mechanistic Interpretability

Draft Paper.

Complexity Reduction in Networks as a Philosophical-Sociological Complement to the Mathematical Philosophy of Mechanistic Interpretability

How a Systems-Theoretic Model of Knowledge Formation Can Extend the AISI Framework

Frank Pieper, February 2026

I. Introduction: Two Perspectives on the Same Problem

Kola Ayonrinde and Louis Jaburi, working at the UK AI Safety Institute, have produced two remarkable papers on the philosophy of Mechanistic Interpretability. The first, A Mathematical Philosophy of Explanations in Mechanistic Interpretability, develops a philosophical foundation for understanding neural networks as containing implicit explanations that can be extracted and understood. The second, Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability, builds on this foundation by proposing a pluralist framework for evaluating the quality of such explanations. Together, these papers establish an information-theoretic epistemology for AI: understanding is compression, structure is compressibility, and good explanations exploit regularities in data.

This essay argues that the Principle of Complexity Reduction in Networks (PKRN), a philosophical-sociological model of knowledge formation in communication networks, constitutes a natural and substantive complement to the AISI framework. The PKRN describes how communicative actors — humans, teams, organisations — form stable knowledge structures through sense-imputation, selection, and stabilisation. It arrives at the same core insight as the AISI papers — that structure is compression-driven — but from a radically different theoretical tradition: systems theory, semiotics, and the philosophy of signs. The convergence is not accidental. It reflects a deep structural homology in the way knowledge is produced, whether by neural networks or by social systems.

Yet convergence is only half the story. The PKRN also introduces dimensions that the AISI framework, by the nature of its object of study, does not address: the genesis of structure in time, the role of sense-imputation as the initiating condition for information processing, the treatment of contingency as an open and revisable dimension, and the embedding of knowledge in multi-layered networks of social, semantic, and semiotic relations. These additions do not contradict the AISI framework. They extend it into territory that matters deeply once we ask not only how AI systems explain themselves, but how their explanations relate to the explanatory structures that humans and organisations actually produce.

II. The Shared Foundation: Compression as Epistemological Principle

The most fundamental point of convergence lies in the concept of structure itself. Ayonrinde and Jaburi define structure in explicitly information-theoretic terms: a system contains structure if its generating process can be expressed more concisely than the observations of the system. Structure is compressibility. They invoke Wilkenfeld’s argument that understanding and compression are closely linked: we understand a phenomenon if we have an explanation that compresses the data into a more concise form such that we could reproduce the data from the explanation or use the explanation to predict future data.

The PKRN arrives at the same definition from a different starting point. In communication networks, actors face a surplus of signals. To remain capable of action, they must select from this surplus. They do so by imputing sense — by forming a hypothesis about what matters. This imputation drives a process of stabilisation: patterns that prove useful in practice become routines, conventions, and eventually the stable structures that organise the network’s behaviour. Heterogeneous signal fields are reduced to fewer, manageable units. The result is the same: a generating description that is shorter than the totality of observations.

What makes this convergence significant is that it is not merely verbal. The AISI papers formalise their claim through Kolmogorov complexity, Shannon description length, and Bayesian likelihood measures. The PKRN provides a complementary formalisation through mathematical category theory: the stabilised patterns are described as colimits — universal constructions in which relations between substructures condense into a new, emergent whole. A colimit is that object which summarises the given relations in the most economical way possible. Both formalisations express the same intuition: structure is the minimal description of a complex field. But they do so from different mathematical traditions, which suggests that the underlying principle is robust rather than artefactual.

III. Connection Point One: Unification and Network Densification

The Explanatory Virtues Framework introduces Unification as a key virtue. Formally, Unification measures the degree to which an explanation accounts for dependencies between observations that would be invisible if each observation were considered in isolation. The mathematical definition captures the difference between the joint likelihood of all observations under the explanation and the product of their individual likelihoods. In plain terms: a unifying explanation reveals connections between phenomena that were previously separate.

This is precisely what the PKRN describes as network densification through cluster formation. According to the Small-World theory of Watts and Strogatz, networks develop dense local clusters with strong internal connections, linked to one another by weak but strategically important bridge nodes. Within clusters, meaning stabilises rapidly through repeated exchange. Between clusters, meaning must be translated — a process that requires effort and semantic work.

The structural parallel is exact. What the AISI framework calls Unification — the integration of previously isolated data points into a common explanatory structure — is what the PKRN calls cluster formation: the emergence of dense relational structures in which heterogeneous signals are bound together by shared sense-imputations. When an explanation in Mechanistic Interpretability achieves high Unification, it does in the formal domain of theory evaluation what the formation of a knowledge cluster does in the social domain of organisational learning: it creates edges between previously disconnected subgraphs.

The PKRN adds a dimension that the formal definition of Unification does not capture: the cost structure of integration. In communication networks, the effort required to stabilise meaning increases disproportionately with semantic distance. This is why clusters have limited size, why meaning travels across boundaries only with significant translation effort, and why the degree of achievable Unification is constrained by the network’s topology. This observation is relevant for Mechanistic Interpretability because it suggests that the degree of achievable Unification in MI explanations may depend not only on the properties of the neural network but also on the structure of the interpreter’s conceptual network — a point that resonates with Ayonrinde and Jaburi’s own recognition that explanations are theory-laden.

IV. Connection Point Two: Hard-to-Varyness and Stable Attractors

The Deutschian virtue of Hard-to-Varyness is one of the most distinctive contributions of the Explanatory Virtues Framework. An explanation is hard to vary if it sits at a local maximum of the function hv(E) = log(Acc(E)) – k(E), where Acc is accuracy and k is a complexity measure. The intuition is that a good explanation cannot be easily modified to accommodate contradictory evidence. It is not a flexible framework that can be stretched to fit anything; it is a precise mechanism whose components are tightly interdependent.

The PKRN describes a structurally identical phenomenon in the formation of stable patterns in communication networks. Structures stabilise where the investment in sense-stabilisation pays off — where the sustained benefit exceeds the cost of maintaining the pattern. These stable patterns are local optima in the tension field between utility and complexity. A successful business model, a well-established terminology, a functioning organisational routine — these are, in the language of the Explanatory Virtues Framework, hard-to-vary structures. They persist not because they are absolutely optimal, but because any local modification would reduce their fit to the communicative environment.

The PKRN enriches this picture by describing the dynamics around these stable points. In the language of dynamical systems, the stabilised patterns behave like solitons — wave-like structures that maintain their form even as they move through a changing medium. A soliton is stable not because the medium is static, but because its internal structure resists perturbation. This metaphor captures something that the static optimisation landscape of the Hard-to-Varyness function does not: the fact that stable structures in real knowledge systems must maintain their coherence while the environment changes around them.

Moreover, the PKRN describes how these stable structures eventually dissolve. Drawing on Thomas Kuhn’s theory of scientific revolutions, it shows that anomalies — signals that deviate from the expectations encoded in the stable pattern — accumulate over time. When they exceed a threshold, the structure breaks apart and is replaced by a new configuration. This dynamic is directly relevant to interpretability: an explanation that is hard to vary at one point in time may become easy to vary as the model changes or as new data become available. The virtue of Hard-to-Varyness, seen through the PKRN lens, is not a static property but a temporal one.

V. Connection Point Three: Ur-Explanations and Suspended Difference

Perhaps the deepest connection between the two frameworks concerns the concept of ur-explanations. Ayonrinde and Jaburi define ur-explanations as the idealised explanations of model behaviour on an input distribution, given in terms of the model’s learned internal structures. These are not explanations that someone formulates from outside; they are explanations that are always already present within a trained, generalising model. The internal computations over learned representations constitute not only a prediction but also an explanation of the process by which the model came to its result. The goal of Mechanistic Interpretability is to extract these ur-explanations — to make explicit what is already implicitly there.

The PKRN develops an analogous concept that it calls „suspended difference“ (stillgelegte Differenz). When a communication network stabilises a pattern, it does not eliminate the differences that were present before stabilisation. It suspends them. The alternatives that were not chosen, the signals that were not selected, the meanings that were not stabilised — they remain present as latent possibilities. The stable structure is not a final state but a provisional settlement, a temporary quieting of difference.

In both cases, the structure contains more than what is visible on the surface. The ur-explanation of a neural network contains the compressed residue of all the training dynamics that shaped it. The suspended difference in a knowledge structure contains the memory of all the alternatives that were not pursued. Both are, in a precise sense, frozen compression results — structural sediments that carry within them the history of their own formation.

But here the parallel reaches its limit and the complementary difference becomes visible. The ur-explanation of a neural network is implicit and must be extracted through interpretive effort. It is, so to speak, sealed. One can open it through analysis (and that is the project of Mechanistic Interpretability), but the model itself cannot revise its own ur-explanations without retraining. The suspended difference in an organisational knowledge structure, by contrast, remains operationally available. Organisations can reopen decisions. They can revisit alternatives. They can use the memory of what was excluded as a resource for learning and adaptation.

This difference matters for the AISI framework because it illuminates a structural limitation of the current concept of explanatory faithfulness. When Ayonrinde and Jaburi define explanatory faithfulness as the degree to which an explanation matches the model’s ur-explanation, they presuppose that the ur-explanation is fixed. The PKRN suggests that in living knowledge systems, the equivalent of the ur-explanation is not fixed but oscillating — it carries within itself the possibility of its own revision. A truly comprehensive account of explanatory quality would need to address this temporal dimension: not only how well an explanation fits the current structure, but how well it captures the structure’s potential for change.

VI. Connection Point Four: The Three-Layered Network and the Limits of Model-Level Explanation

One of the most productive extensions that the PKRN offers concerns the architecture of knowledge itself. Ayonrinde and Jaburi are explicit about the fact that Mechanistic Interpretability produces model-level explanations — explanations that concern the neural network in isolation from the broader system. They acknowledge that this is a limitation: system-level behaviours, multi-agent dynamics, and extended cognitive architectures may require explanation at a level that model-level analysis cannot reach.

The PKRN provides a framework for thinking about what such system-level explanations might look like. Following Renn, Wintergruen, Lalli, Laubichler, and Valleriani, it describes knowledge as embedded in a three-layered network:

The social network comprises the relationships between actors — individuals, groups, organisations. This is the layer of trust, power, collaboration, and conflict. Knowledge here is carried by the patterns of interaction themselves: who communicates with whom, how often, under what conditions.

The semantic network comprises concepts, models, interpretive frameworks, and theories. This is the layer of meaning. Knowledge here resides in the connections between ideas — in the way that terms relate to one another, that models frame observations, that theories structure expectations.

The semiotic network comprises signs, documents, artefacts, and formal codes. This is the layer of inscription. Knowledge here is materialised: in texts, in software, in physical products, in institutional rules. Unlike social and semantic knowledge, semiotic knowledge is comparatively stable — documents do not change their content spontaneously.

The interplay between these three layers is constitutive for the formation and transformation of knowledge. An innovation might begin as a new idea in the semantic network, spread through social relationships, and eventually be stabilised in documents and artefacts. Conversely, a change in external representations — a new software system, a revised contract template — can alter the semantic landscape and thereby the social dynamics of the organisation.

This three-layered architecture offers Mechanistic Interpretability a conceptual model for what system-level explanation could mean. Neural networks, in their deployment context, are embedded in precisely such multi-layered structures. The model’s outputs enter social networks (they are read, discussed, acted upon), they influence semantic networks (they shape the concepts people use), and they become inscribed in semiotic networks (they are stored, cited, embedded in documents and workflows). A comprehensive account of how AI-generated explanations function would need to trace these pathways across all three layers — something that the PKRN’s architecture is designed to do.

VII. Connection Point Five: Temporal Architecture and the Missing Dimension

The deepest and most consequential extension that the PKRN offers to the AISI framework concerns time. Ayonrinde and Jaburi’s framework is, by design, largely atemporal. It evaluates explanations at a given moment: how well does the explanation fit the data? How simple is it? How hard is it to vary? These are properties of a static configuration. The framework does not systematically address how explanations change over time, how the process of explanation itself unfolds, or how the temporal structure of knowledge production affects the quality of explanations.

The PKRN places time at the centre of its analysis. It distinguishes between structural chunking — the spatial organisation of knowledge into clusters and stable patterns — and temporal chunking — the modulation of time through rhythms, cycles, review periods, and planning horizons. These two dimensions are not independent; they interact in a double spiral. Temporal cycles force structural patterns to be reassessed; structural changes reshape the temporal organisation of attention and evaluation.

Temporal chunking means that the network constructs windows of actionable present. Instead of continuously monitoring all signals, organisations establish periods during which they freeze their information state, make decisions, and then update. This is not a limitation but a structural necessity: without temporal chunking, the system would be overwhelmed by the continuous flow of signals and unable to act. Quarterly reports, sprint cycles, annual strategy reviews — these are all instances of temporal chunking that create the conditions for structured knowledge production.

For Mechanistic Interpretability, this temporal dimension is relevant in at least two ways. First, it provides a framework for understanding how MI explanations are produced and consumed over time. The process of interpreting a neural network is not instantaneous; it unfolds in cycles of hypothesis formation, testing, revision, and refinement. The quality of an explanation depends not only on its formal properties at a given moment but also on the temporal process by which it was produced and the temporal context in which it is used.

Second, the concept of temporal chunking illuminates a structural feature of neural networks themselves. A trained model is, in the PKRN’s terms, a system whose temporal chunking has been completed: its time is folded and sealed in parameters. The model cannot revisit its own temporal process. It cannot reopen the decisions that shaped its weights. By contrast, the interpreter of the model operates in open time — with the ability to revise, to take new perspectives, to update explanations. The asymmetry between the closed temporality of the model and the open temporality of the interpreter is, the PKRN suggests, not a peripheral observation but a central structural feature of the interpretability enterprise.

VIII. Connection Point Six: Sense-Imputation and the Genesis of Information

The AISI papers operate within a framework where information is, in a certain sense, already given. The model has been trained on data. The data contain regularities. The model compresses these regularities. Interpretability consists in recovering the compressed structures. At no point does the framework need to ask where the data came from or how they acquired their significance. For the purposes of evaluating explanations, the data simply are.

The PKRN begins from a different premise. In communication networks, information is not given; it is produced. The crucial step is sense-imputation (Sinnunterstellung): before any signal can be processed, before any regularity can be detected, an actor must make an assumption about the potential relevance of the signal. Without this initial hypothesis — „this might matter!“ — no information processing begins at all. Sense-imputation is the generative act that transforms mere signals into potential information.

This concept has no direct equivalent in the AISI framework, and it need not have one for the purposes of evaluating MI explanations. But it becomes important the moment we ask a broader question: What are the conditions under which good explanations can be produced? If explanations are compressions of data, and data become informative only through sense-imputation, then the quality of an explanation depends on the quality of the prior sense-imputations that shaped the data’s significance.

For Mechanistic Interpretability, this has a practical implication. The AISI papers acknowledge that explanations are theory-laden and value-laden — that the interpreter’s priors, goals, and conceptual frameworks shape what counts as a good explanation. The PKRN provides a systematic account of how these priors are formed: through sense-imputation in communicative networks, through the stabilisation of shared concepts, through the establishment of what counts as relevant. The interpreter of a neural network is not an isolated agent applying formal criteria; they are a node in a social-semantic-semiotic network whose sense-imputations have been shaped by that network’s history.

This perspective suggests that the Principle of Explanatory Optimism — Ayonrinde and Jaburi’s conjecture that the algorithmic structures of generalising neural networks are human-understandable — is not a purely cognitive claim. It is a claim about the compatibility between the network structures of AI and the network structures of human knowledge. Whether a model’s concepts are „alien“ in their sense depends not only on the intrinsic complexity of those concepts but also on the structure of the human knowledge networks that would need to receive and integrate them.

IX. Philosophical Underpinnings: Where the PKRN Deepens the AISI Framework

The AISI papers draw on a rich philosophical tradition: Popper on falsifiability, Deutsch on hard-to-vary explanations, Kuhn on theoretical virtues, Hempel on nomological explanation, Marr on levels of analysis. These references are primarily from the philosophy of science and from epistemology in the analytic tradition. They provide an excellent foundation for evaluating the formal properties of explanations.

The PKRN draws on a different, complementary philosophical tradition that adds depth in areas where the AISI framework remains silent.

From Josef Simon’s philosophy of signs comes the concept of the pragmatic stop-rule. Simon argues that we understand a sign when we stop asking for its meaning — when further interpretation becomes unnecessary because the sign functions adequately in practice. This is the philosophical counterpart to the AISI concept of Simplicity: an explanation is simple enough when further elaboration would add complexity without adding understanding. But Simon’s formulation adds a crucial nuance. The stop-point is not determined by the sign alone but by the interpreter’s practical context. The „right“ level of compression depends on what the interpreter needs to do, a point that resonates with Ayonrinde and Jaburi’s recognition of the value-ladenness of explanations.

From Alain Badiou’s ontology comes the concept of counting-as-one. Badiou argues that unity is not a property of things but a result of an operation: we count something as one unit in order to be able to work with it. Every identification of an entity — a feature, a circuit, a representation — is an act of counting-as-one that constitutes the entity as an object of analysis. For Mechanistic Interpretability, this means that the „features“ and „circuits“ that MI discovers are not simply found in the network; they are constituted by the interpretive operation that identifies them. This is not a criticism of MI — it is a philosophical deepening of its self-understanding. The AISI papers come close to this insight when they note that representations require not only Information and Use but also Misrepresentation criteria. Badiou’s framework makes the underlying operation explicit.

From Bernhard Waldenfels‘ phenomenology of order comes the insight that every ordering operation simultaneously includes and excludes. What is selected into a structure is visible; what is excluded from it is pushed to the margins but does not disappear. Waldenfels calls this „order in twilight“ — the recognition that all order is partial, contested, and permeable. For Mechanistic Interpretability, this means that every explanation illuminates some aspects of the network while necessarily leaving others in shadow. The choice of which aspects to illuminate is not neutral; it reflects the interpreter’s values, interests, and theoretical commitments.

These three philosophical contributions — Simon’s pragmatic stop, Badiou’s counting-as-one, Waldenfels‘ order in twilight — provide the AISI framework with a reflexive dimension. They make visible the operations that the framework itself performs: the decision to stop explaining, the constitution of objects of analysis, the inclusion and exclusion that every explanatory choice entails. This reflexive awareness does not undermine the formal rigour of the Explanatory Virtues Framework; it deepens it by revealing the conditions of its own possibility.

X. Practical Implications: What the Complementarity Means

The complementarity between the AISI framework and the PKRN is not merely theoretical. It has practical consequences for how we think about and evaluate AI systems.

First, the PKRN suggests that the evaluation of MI explanations should take into account the network context in which explanations are produced and used. An explanation that is formally excellent — simple, unified, hard to vary — may fail in practice if it does not connect to the semantic and social networks of its intended audience. The three-layered network architecture provides a framework for thinking about the pragmatic conditions of explanatory success.

Second, the temporal dimension that the PKRN introduces suggests that the evaluation of explanations should not be purely synchronic. A good explanation is not only one that fits the current data but one that facilitates the ongoing process of understanding — that opens pathways for further investigation, that makes its own limitations visible, that can be revised as understanding evolves. The Kuhnian virtue of Fruitfulness captures part of this idea, but the PKRN’s account of temporal chunking and autopoietic self-renewal provides a more comprehensive framework.

Third, the concept of sense-imputation has implications for the design of interpretability workflows. If the quality of explanations depends on the prior sense-imputations that shape the interpreter’s attention, then improving interpretability is not only a matter of developing better formal tools. It is also a matter of cultivating the right interpretive dispositions — of training attention, of diversifying perspectives, of creating institutional structures that support sustained and reflexive inquiry.

Fourth, and perhaps most importantly, the PKRN’s account of suspended difference provides a framework for thinking about one of the most pressing questions in AI governance: not only what AI systems know, but what they have excluded in the process of knowing. Every compression involves a loss. Every regularity is extracted at the cost of ignoring irregularities. The PKRN makes this loss visible and suggests that a mature governance framework should attend not only to what AI explanations reveal but also to what they conceal.

XI. The Road Ahead: Towards a Shared Vocabulary

Ayonrinde and Jaburi conclude their second paper with a programme for the future of Mechanistic Interpretability. They identify three research directions: the development of a principled concept of Simplicity, the pursuit of Unification and Co-Explanation, and the derivation of universal (nomological) principles for neural networks. In each of these directions, the PKRN offers a complementary perspective.

On Simplicity: the PKRN’s concept of the pragmatic stop-rule (following Josef Simon) and the operation of counting-as-one (following Alain Badiou) provide a philosophical foundation for what counts as „simple enough.“ Simplicity is not an absolute property but a relational one, defined by the needs and capacities of the interpreter within their communicative network.

On Unification: the PKRN’s description of cluster formation, network densification, and the three-layered architecture of knowledge provides a concrete model for what Unification means in practice. Unifying explanations are those that create connections across previously separate domains — across social, semantic, and semiotic layers of the knowledge network.

On nomological principles: the PKRN itself may be understood as a candidate for such a principle. If the same compression-driven logic of structure formation operates in neural networks, in organisational communication, and in cultural processes of temporal modelling, then this logic itself is a nomological principle of knowledge production — not a law of physics, but a structural invariant of information-processing systems.

The convergence between the AISI framework and the PKRN is, in the end, more than a convenient alignment of two independent research programmes. It points towards a deeper insight: that the structures of knowledge are universal in a specific sense. They are universal not because they are produced by the same mechanisms — neural networks and organisations produce knowledge in fundamentally different ways — but because they are constrained by the same structural logic. Compression of difference, stabilisation of regularity, and the formation of generalisable structures: this is the grammar of knowledge production, regardless of the medium.

The AISI papers provide the formal apparatus for evaluating this grammar in the domain of artificial intelligence. The PKRN provides the philosophical-sociological apparatus for understanding how this grammar operates in the domain of human and organisational knowledge. Together, they open the possibility of a genuinely interdisciplinary account of knowledge — one that can speak with equal precision about the internal structures of neural networks and about the communicative structures of the organisations that build, deploy, and govern them.

References

Ayonrinde, K. & Jaburi, L. (2025). A Mathematical Philosophy of Explanations in Mechanistic Interpretability. The Strange Science: Part I.i. UK AI Security Institute.
Ayonrinde, K. & Jaburi, L. (2025). Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability. The Strange Science: Part I.ii. UK AI Security Institute.
Badiou, A. (2005/2016). Das Sein und das Ereignis. Zurich-Berlin: Diaphanes.
Kuhn, T. S. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago Press.
Pieper, F. (2026). Komplexitaetsreduktion in Netzwerken: Ein universelles Prinzip. Working Paper.
Renn, J., Wintergruen, D., Lalli, R., Laubichler, M. & Valleriani, M. (2016). Netzwerke als Wissensspeicher. In J. Mittelstrass & U. Ruediger (Eds.), Die Zukunft der Wissensspeicher. Konstanz: UVK.
Simon, J. (1989). Philosophie des Zeichens. Berlin: de Gruyter.
Waldenfels, B. (1987). Ordnung im Zwielicht. Frankfurt: Suhrkamp.
Watts, D. J. & Strogatz, S. H. (1998). Collective dynamics of ’small-world‘ networks. Nature, 393(6684), 440-442.

PKRN and the AISI Framwork