Issue #51 // Mapping Biology's Dark Matter

An in depth guide to working with biological knowledge graphs

Sep 02, 2025

Enjoy long form writing? This article has >10,000 words, making it the longest piece on Sequence & Destroy yet. If you want to see more long form technical guides, of this sort, tap the 🖤 in the header above. It’s a small gesture that goes a long way in helping me understand what you value and in growing this newsletter. Thanks so much!

Issue № 51 // Mapping Biøløgy's Dark Matter

When we sequence a human genome, we generate ~100 gigabytes of raw data. When we analyze gene expression across multiple tissues, we produce matrices with hundreds of thousands of data points. When we study protein-protein interactions, we map thousands of binary relationships. And when we examine the scientific literature, we confront over 35 million published papers in PubMed alone. Yet, despite this abundance of data — or perhaps because of it — our understanding of biological systems remain fragmented. We can see pieces of the puzzle, but it’s too large and unwieldy to assemble the complete picture.

As someone interested in developing new technologies, and analysis tools, I’ve become increasingly convinced that our ability to integrate diverse, and heterogeneous, biological data will determine the pace of scientific discovery over the next decade. One way of accomplishing this is with biological knowledge graphs (BKGs). Biological knowledge graphs aren’t just data visualization tools. They’re also interactive computational model that capture the inherent interconnectedness of biological data, and by virtue of that biological systems, while maintaining a richness of context that’s needed for meaningful ''human-ready'' interpretation. By encoding biological entities as nodes and their relationships as edges, BKGs create navigable representations of biological knowledge that transcend one-off, siloed, databases.

In this issue of Sequence & Destroy, we’ll work through the theoretical foundations and practical implementation of biological knowledge graphs, with particular attention to integrating diverse data types: omics data, scientific literature, time-series measurements, and biological ontologies. My goal is not just to explain what BKGs are, but to provide you with the conceptual and technical understanding needed to build and utilize them in your own work.

THE ANATOMY OF A KNOWLEDGE GRAPH

Why Graphs for Representing Biology?

Biologically is fundamentally relational1. Genes regulate other genes, proteins modify and interact with other proteins, and diseases affect phenotypes. Traditional relational databases, which organize data into tables (aka "relations") with rows and columns excel at storing and retrieving well-structured data with predefined relationships. However, they impose rigid schemas that struggle to accommodate the complex, non-linear, and evolving nature of biological knowledge.

This becomes clear when attempting to modeling branched pathway systems, for example. When we change the concentration of one metabolite, we don’t just alter its immediate product; we potentially affect countless downstream processes through regulatory cascades and metabolic shifts. A graph model seamlessly accommodates these multi-step, non-linear, relationships.

Consider a simple example: a single nucleotide polymorphism (SNP) in the coding region of a gene. This variant alters the amino acid sequence of the encoded protein, which changes its binding affinity with certain molecules, in turn modifying a signaling pathway that affects cellular behavior, ultimately contributing to disease manifestation. In a relational database, capturing this causal chain requires complex joins across multiple tables. In a graph, it's a direct path from SNP to disease, with each intermediate entity and relationship explicitly represented, as shown below.

Knowledge graphs enable complex queries like"Show me all patients with X gene variant (SNP) who present with Y disease, and identify which proteins might explain this relationship."

Graphs offer more than just a convenient representation though. They enable computations that are cumbersome or impossible to perform in relational models. For example, finding all genes two regulatory steps away from a target gene becomes a simple traversal operation. Additionally, identifying common mechanisms between seemingly unrelated diseases becomes a matter of finding shared subgraphs and prioritizing candidate genes based on network connectivity becomes straightforward through centrality metrics2.

The Building Blocks of Biological Knowledge Graphs

Biological Knowledge Graphs are composed of three basic building blocks: nodes, edges, and properties. Nodes, also known as vertices, represent biological entities and the granularity of these entities determines the utility of the graph. For example, consider bulk RNA-sequencing data; while you could represent an entire RNA-seq experiment’s data as a single node in a graph, this approach obscures the gene-level information that makes the data valuable in the first place. Instead, each gene can be represented as a node, with the relationships between genes encoded in edges, and experimental metadata attached as properties. Similarly, for single-cell RNA-seq data, individual cells become nodes, connected to gene nodes through expression relationships3.

When working with scientific literature, we face a similar question about granularity. Treating each word as a node would create an unwieldy graph with minimal biological meaning. Instead, we extract biomedical concepts—genes, proteins, diseases, phenotypes, compounds—that serve as the backbone of the graph. So, for example, if we were working with a paper about SARS-CoV-2 we might extract entities like "ACE2 receptor," "viral entry," "cytokine storm," and "pulmonary infiltrates," each becoming a node connected to other entities through relationships described in the text.

A simple graph schema where nodes, cell types, and anatomy are represented as nodes and the relationships between them are expressed with edges

Whereas nodes represent biological entities, edges, often called relationships, connect nodes and represent biological interactions, associations, or causal links. The semantics of these edges (i.e, what they mean and how they're established) determine the analytical power of the graph. In protein-protein interaction networks, edges typically represent physical binding events. In co-expression networks, they represent statistical correlation. In pathway databases, they may represent enzymatic actions or regulatory influences. Importantly, all of these different types of relationships can exist in a single graph.

The edges, and interactions they represent, described above are fairly straightforward. Identifying edges becomes much more challenging when connecting entities across different data types. For example, how do we establish meaningful edges between transcriptomic and metabolomic data or between literature-derived concepts and time-series physiological measurements? This is where the integration techniques I'll discuss later come in. But, before we get to that we need to understand the third building block of BKGs: properties.

Properties enrich both nodes and edges with additional information4. Node properties might include expression values, genomic coordinates, structural features, or ontological classifications. Edge properties might include interaction strengths, statistical confidence scores, experimental methods, or literature citations. These properties transform a simple topological structure into a rich knowledge repository that supports nuanced analysis. In my experience, the judicious use of properties often distinguishes a useful BKG from an overwhelming one. It takes practice to find a ''Goldilocks zone''. Too few properties, and the graph lacks the context needed for meaningful interpretation. Too many properties distributed arbitrarily, and the graph becomes cluttered and unwieldy.

CONSTRUCTING BIOLOGICAL KNOWLEDGE GRAPHS

Data Selection and Preparation

Building a biological knowledge graph involves a series of technical decisions and processes that transform diverse data into an integrated, queryable knowledge structure. The first step in BKG construction is identifying the relevant data sources for your biological question. Are you investigating disease mechanisms? Drug responses? Countermeasures for aging? The answer will guide your selection of data types and specific datasets.

Personally, I’ve found it useful to have take a two-pronged approach to data selection, with each BKG I build have two conceptual layers: a core graph layer and an extension layer. The foundation of the BKG is the core graph, which provides the broader biological context necessary for study-specific data interpretation, ensuring that any given analysis can draw upon the full scope of documented biological relationships rather than being limited to only the molecules we happen to measure in a given experiment. As a result, I prefer to take a study-agnostic approach to core graph development, pulling in all of the proteins in Uniprot, metabolites in HMDB, pathway and gene ontology term mappings, etcetera.

With this approach, we can then build extension layers upon the core graph layer, including condition-specific experimental data, and creating study-specific nodes, relationships, and properties. This layer contains genes, proteins, and metabolites, and other omics data (collectively termed “entities”) that are directly measured in an experimental condition as well as pathways/GO terms that are enriched by up- and down-regulated entities identified via differential expression analysis. Additionally, the extension graph layer includes other relationships derived directly from experimental data, such as condition-specific co-expression, co-enrichment, and differential expression relationships.

Incorporating this study-specific data into an extension layer typically involves preprocessing steps familiar to all bioinformatics scientists: quality control, filtering and normalization, and batch correction (and optionally downstream analysis, such as differential expression analysis). The goal is to convert raw measurements into biologically meaningful signals that can be represented as nodes or edge properties in the graph.

For literature data, preparation involves assembling a corpus of relevant publications and applying natural language processing techniques to extract structured knowledge. This might include full-text processing, extracting information from unstructured and structured text without semantic considerations, abstract mining, which uses semantic background knowledge to extracts key information and insights, or leveraging existing literature-mining databases like STRING for protein interactions or DisGeNET for gene-disease associations. For ontological data, preparation involves selecting appropriate controlled vocabularies and relationship frameworks. For example, Gene Ontology provides standardized terms related to biological processes, cellular components, and molecular functions, as well as relationships, that can form a semantic backbone for your graph.

In practice, starting with a focused question and a limited set of high-quality data sources produces more immediately useful graphs than attempting to integrate everything at once. My recommendation is to begin with the core data types most relevant to your question, establish a robust integration framework, and then incrementally expand as needed.

Node Creation — Representing Biological Entities

Creating nodes from diverse biological data sources requires both domain-specific knowledge and technical expertise. For example, the relational database below shows five different node types with different properties that can be associated with them (or edges connecting them to other nodes).

The process of creating nodes typically involves data parsing, entity recognition, identifier mapping, and property assignment. Each node receives a unique identifier within the graph, while maintaining references to external identifiers in relevant databases. Below I’ll break this down step-by-step:

Step 1 — Data Parsing: In this step raw data from various sources (like CSV files) is read and broken down into component parts. For example, when processing an RNA-seq dataset, data parsing could involve reading the expression table, recognizing column headers as sample identifiers, and row labels as gene identifiers.
Step 2 — Entity Recognition: This step identifies the distinct biological entities that will become nodes in the graph. It involves determining what constitutes a valid entity (gene, protein, metabolite, disease, etcetera), resolving ambiguities when the same term might refer to different entities, or extracting entities from unstructured text using NLP techniques. For example, when processing scientific literature, entity recognition might identify "BRCA1" as a gene, "breast cancer" as a disease, and "DNA repair" as a biological process.
Step 3 — Identifier Mapping: This step ensures consistency across data sources by mapping entities to standardized identifiers. This involves converting between different naming conventions (e.g., gene symbols to Ensembl IDs or protein names to UniProt IDs), resolving synonyms to canonical forms (ie, identifying and standardizing different words or phrases that refer to the same concept or entity), and handling species-specific identifiers. For example, the human gene "TP53" might be mapped to Ensembl ID ENSG00000141510, Entrez Gene ID 7157, and UniProt ID P04637, ensuring that information about this gene from different sources can be properly integrated.
Step 4 — Property Assignment: This step attaches relevant attributes to each node, enriching it with biological context. Properties might include quantitative measurements (expression levels, abundance), physical characteristics (sequence, structure), classification information (gene biotype, protein family), and metadata (source database, confidence scores). For a gene node, properties might include its chromosomal location, transcript count, and expression values across different experimental conditions.
Step 5 — Unique Identifier Within the Graph: Each node in the graph receives an internal identifier (often auto-generated) that uniquely distinguishes it within the graph database. This identifier allows efficient lookup and traversal operations, remains consistent even if external identifiers change, and enables proper relationship creation between nodes. For instance, in a Neo4j database, each node might receive an auto-generated numeric ID that serves as its primary reference within the graph structure.
Step 6 — References to External Identifiers: While nodes have internal graph identifiers, they maintain links to standard external identifiers that are recognized in the broader biological community. This facilitates data integration with external resources, enables validation against reference databases, and supports export of results in standard formats. For example, a protein node might have an internal graph ID of "n12345" but maintain references to its UniProt ID (P04637), RefSeq accession (NP_000537), and PDB structures (2XWR, 4HJE) in its properties.

This approach ensures that each biological entity is uniquely and consistently represented in the knowledge graph while maintaining connections to the broader ecosystem of biological databases and nomenclatures.

In practice, this often requires custom ETL (Extract, Transform, Load) pipelines tailored to each data source. For example, when integrating RNA-seq data, your pipeline might parse gene expression matrices, map gene identifiers to a standard namespace, calculate expression statistics, create node records with appropriate properties, validate node uniqueness, and load nodes into the graph database. Similar pipelines would process protein data, metabolite data, and other entity types, each with source-specific logic but producing consistently structured node records.

Edge Creation — Establishing Biological Relationships

Creating meaningful edges between nodes is perhaps the most challenging aspect of BKG construction. This process transforms a collection of disconnected entities into a navigable knowledge network. For relationships within a single data type, edge creation often relies on established biological knowledge or statistical associations. For example, within gene expression data, edges might represent co-expression relationships, established through correlation analysis, mutual information calculations, or network inference algorithms like WGCNA (weighted gene co-expression network analysis). These edges typically carry properties indicating the strength and direction of association5.

With protein data, edges might represent physical interactions detected through techniques like protein-protein interaction network analysis, affinity purification mass spectrometry, or predicted through structural modeling. Properties might include detection methods, binding affinities, and cellular contexts.

Purple nodes = proteins, blue nodes = gene ontology biological processes. In this example, proteins can interact with or be co-expressed with other proteins and they can enable biological processes.

For relationships across different data types, edge creation becomes more complex and might employ several approaches. For example, statistical integration establishes edges based on quantitative relationships between entities of different types, such as correlating gene expression levels with metabolite concentrations across samples to reveal gene-metabolite relationships. Similarly, relating protein phosphorylation states to downstream transcriptional changes can establish potential causal connections.

The technical implementation typically involves data alignment to ensure comparable samples or conditions, correlation or regression analysis, significance testing with multiple hypothesis correction, filtering based on effect size and statistical thresholds, and the creating edge records with appropriate properties (ie, generating the actual connections between nodes and enriching them with relevant biological information).

Another approach to cross-data type integration is knowledge-based integration which leverages existing biological knowledge to establish cross-type relationships. For example, if we know that Gene A encodes Protein B, which catalyzes the production of Metabolite C, we can encode these relationships even if they weren't directly measured in our experimental data. This approach often involves mining biological databases (Reactome, KEGG, UniProt) for established relationships, mapping entities to standardized identifiers, validating relationship relevance to the biological context, and creating edge records with provenance information.

A third option is literature-derived integration, which uses text mining to identify relationships described in scientific publications. For example, using natural language processing techniques to extract relationships like "Protein X phosphorylates Protein Y" or "Compound Z inhibits Enzyme W." The technical process includes named entity recognition to identify biological entities in text, relationship extraction to identify connections between entities, event extraction for complex multi-entity relationships, normalization to standard identifiers, confidence scoring based on extraction reliability and publication metrics, creation of edge records with citation information

Integrating Text and Numerical Data — The Bridge Between Worlds

Integrating text-derived knowledge with numerical data presents unique challenges and opportunities. In this section I’ll present three ways of going about this: entity-mediated integration, evidential integration, and context integration.

Entity-mediated integration uses biological entities as bridges between literature and experimental data. For instance, if text mining identifies a relationship between Gene A and Disease B, and your transcriptomic data shows differential expression of Gene A in a disease model, the gene serves as the connection point between these information sources. To implement this approach we can extract biomedical concepts and relationships from literature, map these concepts to the same standardized identifiers used for omics data, create nodes for concepts not already represented in the graph, and establish edges between text-derived and experimentally-derived nodes.

Evidential integration treats different data sources as complementary evidence for biological relationships. A relationship suggested by literature might be strengthened by supporting experimental data, or vice versa. This approach requires defining an evidence model that quantifies confidence from different sources, implementing a fusion algorithm that combines evidence scores, and representing confidence as edge properties in the graph.

Finally, context-based integration uses contextual information to establish relationships between text and numerical data. For example, if literature discusses Gene A in the context of cellular process B, and your proteomics data shows altered abundance of proteins involved in process B, these shared contexts establish indirect relationships. Implementing this approach involves extracting contextual information from text (cell types, conditions, processes), mapping experimental conditions to the same contextual framework, and creating context nodes that connect entities from different data types.

Technical Implementation Details for BKGs

Before discussing the analysis of Biological Knowledge Graphs, we’ll examine the technical aspects of implementing them, focusing on practical considerations that affect functionality and performance. First, we need to select a graph database to work with. The choice of graph database technology significantly impacts what you can do with your BKG and how efficiently you can do it. Some notable options, each with distinct advantages, are as follows:

Option 1 — Neo4j: Neo4j provides a property graph model that accommodates richly attributed nodes and edges, making it suitable for biological data with many different properties. Its Cypher query language offers an intuitive syntax for graph patterns, and its ecosystem includes visualization tools and analytics libraries. Often, Neo4j is used for projects requiring rapid prototyping and interactive exploration of complex biological networks.

Option 2 — RDF Triplestores: RDFT represents data as subject-predicate-object triples, aligning with semantic web standards. It excels at implementing ontology-driven schemas and support SPARQL queries. These systems are particularly valuable when interoperability with external knowledge bases is a priority.
Option 3 — HyperGraphDB: HyperGraphDB supports hypergraphs where edges can connect multiple nodes simultaneously—a useful feature for representing complex biological processes like metabolic reactions with multiple substrates and products.

In practice, your choice should consider the complexity of your biological entities and relationships, scale of data (some systems handle billions of edges more efficiently), need for standards compliance and external interoperability, and development resources available for implementation, when choosing a graph database.

Beyond the database technology, you need an architecture for continuous data integration. Several approaches exist. For example, centralized ETL processes all data sources through a unified pipeline before loading into the graph. This approach ensures consistency but may create bottlenecks for large-scale integration. Federated integration, on the other hand, maintains source data in native storage systems and creates a unified query interface across them. This reduces data duplication but may impact query performance. Finally, incremental updates enable continuous integration of new data without rebuilding the entire graph. This is essential for incorporating new publications or experimental results as they become available. Often, hybrid approaches work best—using centralized ETL for stable reference data while implementing incremental updates for rapidly changing sources like scientific literature.

ANALYSIS OF BIOLOGICAL KNOWLEDGE GRAPHS

Graph Traversal — Navigating Biological Relationships

Once constructed, a Biological Knowledge Graph enables analyses that can reveal insights difficult to obtain through traditional bioinformatics approaches. Graph traversal—the process of visiting nodes and edges in a systematic manner—forms the foundation of knowledge graph analysis6. Two of the simplest graph traversal methods to start with are breadth-first search and depth-first search, which are depicted below:

Breadth-first search (left) vs depth-first search (right). Image source.

Breadth-First Search (BFS) explores all neighbors at the current depth before moving deeper. In biological contexts, BFS efficiently answers questions like: "What proteins interact directly with Protein X?" or "What genes are one regulatory step away from Gene Y?" The BFS algorithm works by starting at a source node (e.g., a gene of interest). exploring all immediate neighbors, moving to the next depth level, and repeating this processes until a specified depth is reached or a target is found.

BFS is particularly valuable for finding shortest paths between biological entities, which often reveal the most direct mechanisms connecting them. For instance, to understand how a drug might affect a disease-associated pathway, BFS can find the shortest connection between the drug target and pathway components.

Depth-First Search (DFS), on the other hand, explores as far as possible along each branch before backtracking. This algorithm excels at tracing extended paths through biological networks, such as signal transduction cascades or metabolic pathways. DFS implementation involves starting at a source node, selecting one neighbor and recursively exploring its branches, backtracking when a branch is fully explored, and continuing until all reachable nodes are visited. In practice, DFS helps answer questions like: "What are all possible downstream effects of inhibiting this kinase?" or "Through what routes can this metabolite be synthesized from precursors?"

Next, we have path finding algorithms, like Dijkstra's or A*, which find optimal paths between biological entities based on edge weights. These weights might represent interaction strengths, confidence scores, or other biological metrics. For implementation we define a biologically meaningful edge weight (e.g., binding affinity, statistical confidence), apply the algorithm to find the path minimizing the total weight, and interpret the path in the biological context. These algorithms help identify the most reliable or strongest connections between biological entities, such as the highest-confidence path connecting a genetic variant to a disease phenotype.

Finally, we have random walks, which traverse the graph in a stochastic manner, providing a foundation for node embedding techniques and centrality metrics. In biological networks, random walks can identify functional modules and prioritize genes or proteins based on network topology. Random walk implementations might start at seed nodes (e.g., known disease genes), take probabilistic steps through the network, record visit frequencies for each node, and rank nodes by these frequencies to prioritize candidates. This approach underlies algorithms like PageRank and has been adapted for gene prioritization in disease studies (e.g., the Random Walk with Restart method).

Querying Biological Knowledge Graphs — Extracting Actionable Insights

Effective querying transforms a BKG from a data repository into a discovery platform. Different graph databases support different query languages, each with strengths for particular biological questions.

Cypher (Neo4j) provides an intuitive, pattern-matching syntax. For example, to find all proteins that interact with a kinase and are involved in apoptosis, you may write something like:

MATCH (k:Protein {type: 'kinase'})-[:INTERACTS_WITH]->(p:Protein)-[:INVOLVED_IN]->(a:BiologicalProcess {name: 'apoptosis'}) RETURN k.name, p.name

This pattern-based approach aligns naturally with how biologists think about molecular relationships, giving outputs like:

+---------------+------------------+
| k.name        | p.name           |
+---------------+------------------+
| CDK1          | BAX              |
| PKC-delta     | BAD              |
| GSK3B         | MCL1             |
| ROCK1         | CASP3            |
| ...           | ...              |
+---------------+------------------+

SPARQL (RDF triplestores) enables precise semantic queries across linked data. A SPARQL query to find drugs that target proteins in a specific pathway might look like:

PREFIX drugbank: <http://bio2rdf.org/drugbank_vocabulary:>
PREFIX go: <http://purl.obolibrary.org/obo/>

SELECT ?drug ?protein ?pathway
WHERE {
  ?drug drugbank:target ?protein .
  ?protein go:participates_in ?pathway .
  ?pathway rdfs:label "insulin signaling pathway" .}

SPARQL excels at leveraging ontological relationships and standardized vocabularies, giving outputs like:

+-------------------------+-------------------------+-------------------
| ?drug                   | ?protein                | ?pathway                             
+-------------------------+-------------------------+-------------------
| drugbank:DB01283        | uniprot:P06213         | go:0008286                           |
| drugbank:DB00030        | uniprot:P01308         | go:0008286                                                            
| ...                     | ...                     | ...                                  
+-------------------------+-------------------------+-------------------

Gremlin (Apache TinkerPop) provides a traversal-based query language that can express complex multi-step paths through the graph. For example, to find all metabolites affected by both a drug and a genetic variant:

g.V().hasLabel("Drug").has("name", "Metformin")
  .out("TARGETS").hasLabel("Protein")
  .out("CATALYZES").hasLabel("Reaction")
  .out("PRODUCES").hasLabel("Metabolite")
  .as("drugAffectedMetabolites")
  .V().hasLabel("Variant").has("rsid", "rs1801282")
  .out("AFFECTS").hasLabel("Gene")
  .out("ENCODES").hasLabel("Protein")
  .out("CATALYZES").hasLabel("Reaction")
  .in("CONSUMES").hasLabel("Metabolite")
  .where(P.within("drugAffectedMetabolites"))
  .valueMap()

Which, may give an output like:

[{"id": "m12345",
    "name": "Glucose-6-phosphate",
    "hmdb_id": "HMDB0001401",
    "molecular_weight": 260.14,
    "cellular_location": "cytoplasm",
    "pathway": ["Glycolysis", "Pentose phosphate pathway"]
  },...etc]

In practice, effective biological querying requires understanding the graph schema (node labels, relationship types, properties), translating biological questions into graph patterns, optimizing queries for performance, particularly for large-scale graphs, and interpreting results in their biological context.

Note: Its helpful to developing a library of template queries for common biological questions, which can be parameterized and reused across projects.

Network Analysis Metrics — Finding Biological Significance in Topology

Graph theory provides powerful analytical tools that can extract biological meaning from network structures. These metrics help identify important entities, functional modules, and system properties. First, we’ll discuss a handful of centrality measures, which quantify the importance of nodes based on their position in the network.

Degree centrality simply counts the number of connections for each node. In biological networks, high-degree nodes (e.g., hubs) often represent important genes or proteins involved in many functions. The p53 tumor suppressor protein, for example, typically shows high degree centrality due to its numerous interactions with DNA, other proteins, and regulatory elements. To calculate degree centrality you can count incoming and outgoing edges for each node, normalize by the maximum possible connections if comparing across networks, and rank nodes by their normalized scores.

Betweenness centrality measures how often a node appears on shortest paths between other nodes. Nodes with high betweenness often represent biological entities that bridge different functional modules or pathways—potential intervention points for modulating biological systems. Calculating betweenness centrality involves computing all shortest paths between node pairs, counting how many paths each node appears in, normalizing by the total number of paths, and ranking nodes by their normalized scores.

Closeness centrality calculates how close a node is to all other nodes in the network. In biological contexts, high closeness often indicates entities that can efficiently influence many others—potential master regulators or signaling hubs. To calculate closeness centrality you compute shortest path lengths from each node to all others, take the reciprocal of the sum of these lengths, and normalize and rank nodes.

Finally, eigenvector centrality measures node importance based on the importance of its neighbors. This recursive definition identifies nodes connected to many other important nodes—often revealing key regulatory entities in biological networks.

In practice, centrality analysis helps prioritize biological entities for further investigation. For example, when analyzing a disease-associated gene network, high-centrality genes, or hubs, that haven't previously been linked to the disease become prime candidates for functional validation.

Community Detection algorithms identify densely connected subgraphs that often correspond to functional modules in biological networks. These communities might represent protein complexes, metabolic pathways, or co-regulated gene sets. Several algorithms exist including the Louvain method, which optimizes modularity through hierarchical clustering, Markov clustering, which simulates flow through the network, and spectral clustering, which leverages the graph's eigenvalue decomposition.

For biological applications, It’s often helpful top applying multiple community detection algorithms, comparing the resulting communities for consistency, validate communities against known biological modules, and performing enrichment analysis to identify community functions.

Finally, we have network motifs are recurring patterns of connectivity that appear more frequently than expected by chance. In biological networks, motifs often represent fundamental regulatory circuits with specific functions—feedback loops, feed-forward loops, or bistable switches. Motif analysis involves defining patterns of interest like three-node feedback loops, counting their occurrences in the graph, comparing to random graphs to establish statistical significance, and mapping significant motifs to their biological contexts. These structural patterns often reveal design principles of biological systems, such as the prevalence of negative feedback for homeostasis or positive feedback for bistable decision-making.

CASE STUDIES & APPLICATIONS

Drug Repurposing Through Network Analysis

Drug development costs billions and takes decades, and is only getting worse, as described by Eroom’s Law. Repurposing existing drugs for new indications offers a faster, more economical path to treatments. Knowledge graphs excel at identifying repurposing opportunities by connecting drugs to diseases through unexpected molecular mechanisms.

To use BKGs to identify drug repurposing opportunities might construct a BKG integrating drug-target interactions from DrugBank and ChEMBL, protein-protein interactions from STRING and BioGRID, gene-disease associations from OMIM and GWAS catalogs, and pathway information from Reactome and KEGG.

We can then apply a path-finding algorithm to identify indirect connections between approved drugs and diseases without direct treatment indications. This approach has previously been used to repurpose topiramate, an anticonvulsant, for inflammatory bowel disease, as discussed in a paper titled A Literature-Based Knowledge Graph Embedding Method for Identifying Drug Repurposing Opportunities in Rare Diseases.

Multi Omics Integration for Elucidating Disease Mechanisms

Understanding complex diseases requires integrating multiple molecular layers, such as genetic variants from GWAS and whole-genome sequencing, transcriptomic data, proteomic and phosphoproteomic profiles, metabolomic measurements, and relevant literature-derived relationships. After these types of data are compiled and integrated within a BKG, community detection algorithms can be used to identify molecules that contain genetic variants, differentially expressed genes, altered proteins, and metabolite changes that converge on a coherent biological mechanism.

The technical approach involved in this process entails creating nodes for entities at each omics layer, establishing statistical edges based on correlation analysis, integrating known biological relationships from literature, applying Louvain community detection, and performing functional enrichment on identified communities. This multi-omics perspective can reveale intervention points that may be missed when analyzing any single data type in isolation.

Literature-Based Discovery Through Semantic Integration

Scientific literature contains vast knowledge that, when properly integrated, can reveal non-obvious connections. For example, in a paper titled Alzheimer’s Disease Knowledge Graph Enhances Knowledge Discovery and Disease Prediction, the investigators constructed a literature-derived knowledge graph for neurodegenerative diseases by processing 800 abstracts related to neurodegeneration, extracting biological entities and relationships using NLP, normalizing entities to standard identifiers and creating a semantic network of interconnected concepts.

Through this approach, they shed light on shed light on existing treatments, potential targets, and diagnostic methods for Alzheimer's disease, aiding in drug repurposing and the identification of biomarkers.

Implementing this type of analytical approach requires developing domain-specific NER models for neuroscience concepts, creating relationship extraction rules for neurodegenerative mechanisms, implementing a semantic similarity metric based on graph distance, and ranking compounds by their semantic proximity to disease mechanisms. This case illustrates how BKGs can accelerate hypothesis generation and target identification by connecting information scattered across hundreds of publications.

TECHNICAL AND BIOLOGICAL CHALLENGES

Scalability and Computational Efficiency

Building and utilizing biological knowledge graphs presents several challenges that require thoughtful solutions. As biological data continues to grow exponentially, scalability becomes critical. Several strategies help manage large-scale BKGs, as outline below:

Distributed graph processing leverages frameworks like Apache Giraph or GraphX to distribute computation across clusters. This enables analysis of graphs with billions of edges—scales now common in comprehensive biological knowledge bases.
Graph Compression techniques reduce memory requirements without losing significant information. Bit-packed adjacency lists, quasi-succinct indices, and other compression schemes can reduce storage needs by orders of magnitude.
Approximate Query Processing sacrifices perfect accuracy for speed when appropriate. For large-scale biological networks, algorithms like approximate nearest neighbors or probabilistic counting often provide practical results much faster than exact methods.
Strategic Subgraphing extracts relevant portions of the full graph for specific analyses. Rather than querying the entire knowledge graph for every question, extract the subgraph containing nodes and edges relevant to your biological context.

Data Quality and Uncertainty Representation

Biological data comes with inherent uncertainty and quality variations. A robust BKG must represent this uncertainty rather than treating all relationships as equally reliable. Some methods for accomplishing this are:

Edge Weighting, which encodes confidence or strength in relationship edges. These weights might represent statistical significance (p-values, q-values), effect sizes (correlation coefficients, fold changes), evidence strength (number of supporting experiments), or literature support (citation counts, journal impact).
Provenance Tracking, which maintains records of data sources for each node and edge. This allows filtering based on source reliability and explains why particular relationships appear in the graph.
Contradictory Relationship Handling, which addresses cases where different sources suggest conflicting relationships. Rather than arbitrarily selecting one, represent both with their supporting evidence, allowing downstream analysis to make informed decisions.
Temporal Versioning, which captures how biological knowledge changes over time. This is particularly important for literature-derived relationships, where scientific consensus evolves as new evidence emerges.

Biological Interpretation and Validation

A BKG is only valuable if its insights translate to biological understanding and experimental validation. To ensure the validity of graph-derived insights, it’s important to run experiments using testable hypotheses derived from the BKG — the greater the extent to which computational predictions are born out of experimental results, the more confident we can be of the BKGs validity.

Another method of ensuring validity is biological plausibility filtering, which is a method for applying domain knowledge to assess whether computationally identified relationships make biological sense. This might involve checking cellular co-localization of interacting proteins, verifying compatible biological processes, or ensuring temporal consistency in developmental contexts.

Finally, iterative refinement approaches can incorporate experimental results back into the graph, creating a virtuous cycle of prediction, validation, and knowledge enhancement.

THE ONGOING EVOLUTION OF BKGs

As technology advances and biological understanding deepens, several emerging approaches promise to enhance the utility of biological knowledge graphs.

Temporal Knowledge Graphs (TKGs) explicitly model how biological relationships change over time—critical for understanding developmental processes, disease progression, and treatment responses. These graphs add a temporal dimension to edges, representing when relationships are active or how they change in strength over time. Implementing TKGs entails time-stamping edges with validity periods, using time series as edge properties, creating snapshot graphs for discrete time points, and developing temporal query languages. This approach is of particular interest to me, as I frequently work with time-series physiological data.

Graph Neural Networks (GNNs) combine the representational power of graphs with the predictive capabilities of deep learning. These models learn from graph structure and node/edge features to predict new relationships or node properties. Applications include predicting protein functions from interaction patterns, identifying disease-gene associations, forecasting drug side effects from molecular structure and target interactions, and suggesting potential drug combinations based on network effects.

There is also a push to integrate explainable AI into knowledge graphs, using machine learning approaches to not only generate predictions, but also mechanistic explanations. By grounding ML predictions in a knowledge graph, these approaches produce interpretable results that biologists can evaluate and extend.

Conclusion — From Data Integration to Biological Understanding

Biological knowledge graphs represent a fundamental shift in how we approach biological data integration—from merging tables to connecting concepts in their full biological context. By preserving the inherent relationships between biological entities, biological knowledge graphs enable navigation across molecular scales, data types, and knowledge domains.

For the bioinformatics scientist working to make sense of diverse datasets, knowledge graphs offer both a conceptual framework and a practical technology. They transform the challenge of data overload into an opportunity for discovery by revealing connections that remain hidden when analyzing datasets in isolation.

As you start building your own biological knowledge graphs, remember that the goal is not the graph itself, but the biological insights it enables. Start with clear questions, choose appropriate data sources, implement rigorous integration methods, and apply analytical approaches that align with your biological objectives.

The most powerful aspect of the knowledge graph paradigm is not its technical sophistication but its alignment with how biology actually works—as a complex, interconnected system where context determines function and where relationships are as important as the entities themselves. By representing biology in this natural form, we create not just a database but a computational model that captures the essence of living systems themselves.

Did you enjoy this piece? If so, you may enjoy the following issues of Sequence & Destroy:

The relationality of data is a key focus of my previous article, Connecting The Dots: A Guide to Multi Omics Data, which is all about understanding the spatial and temporal relationships between genomic, epigenomic, transcriptomic, proteomic, metabolomic, and phenotypic data and how they are interconnected.

Common centrality metrics include node degree, degree centrality, betweenness centrality, and eigenvector centrality. If you want a primer on static and dyaminc network models you can check out my previously published article The Biology Network. Additionally, I’ve written about how different centrality metrics can be used for drug target discovery in Leveraging Network Analysis For Drug Target Discovery.

In practice this could look like gene—[UP/Down_REGULATED]—>cell type, which is a common format used in BKGs for representing relationships between biological entities.

Both nodes and edges can be augmented with properties. Imagine we combine data across multiple experiments in a single BKG — each node (gene, protein, etc) can have a properties associated with it, such as study IDs, or additional contextual metadata. Additionally, imagine we have two protein nodes connected by a "coexpressed with" edge — in this case the edge may have properties including the correlation of the coexpression relationship, or its statistical significance.

It’s important to note that relations are often condition specific, and as a result edges may need to include condition-specific properties. For example, we may find that gene A is co-expressed with gene B in our dataset, and as a result we can represent this relationships as GeneA-[COEXPRESSED_WITH]→GeneB in our BKG. However, this relationship may be specific to our experimental condition. As to not misrepresent this relationship— or create spurious paths in our graph when adding other experiments data to it— we should aways enrich relationships with the appropriate metadata.

Graph traversals can become much more complex, and nuanced, than what’s described in this section. For example, part of my PhD research involves creating new traversal methods to predict clinical symptomology following novel pathogen exposure, which utilizes concepts from information theory, network propagation algorithms, and other methods.