Cosegregation

Nuclear profile of genome. (A) Nucleus, (B) nuclear profile, (C) loci (green dots) where parts of target gene found.

Cosegregation, in genealogy, refers to the tendency of two or more genes located close together on the same chromosome to be inherited together during cell division. Due to their physical proximity, these genes are considered genetically linked and are likely to be inherited together.^[1]

In genetics, the term may also refer to the estimated probability of interaction between multiple loci or specific regions within a target gene. This probability is assessed using data derived from nuclear profiles (NPs), which are thin slices taken from a cell nucleus. Within each NP, the presence or absence of particular loci is evaluated.^[2]

These interaction probabilities—referred to as cosegregation values—are used in mathematical models such as SLICE^[3] and normalized linkage disequilibrium. These models contribute to the generation of 3D genome architecture maps as part of genome architecture mapping (GAM) techniques. The resulting 3D renderings provide insights into genomic density and the radial positioning of loci within the nucleus.

Articles using co-segregation methodologies
Title	Description
Complex multi-enhancer contacts captured by Genome Architecture Mapping (GAM).^[3]	Co-segregation between a pair of loci helped in this study to quantify Normalized Linkage Disequilibrium.
A simple method for cosegregation analysis to evaluate the pathogenicity of unclassified variants; BRCA1 and BRCA2 as an example.^[4]	Using co-segregation analysis along with a multifactorial approach resulted in highly conclusive results when attempting to classify unclassified variants.
Considerations in assessing germline variant pathogenicity using co-segregation analysis.^[5]	This article found that utilizing Bayes factor co-segregation analysis, along with a strong penetrance model, will result with higher accuracy than meiosis counting.
A Comparison of Cosegregation Analysis Methods for the Clinical Setting^[6]	Compares the utility of using full likelihood Bayes factor, cosegregation likelihood ratios, and counting meiosis to evaluate the pathogenicity of genetic variants.
Dissecting the co-segregation probability from genome architecture mapping	Assesses the utility of cosegregation in Genome Architecture Mapping, finding normalized probability calculations a reasonable representation of inter-locus distance^[7]

Some of the earliest known studies that have used cosegregation in genealogy dates back to the early 1980s. Around this time, scientists were conducting experiments on vegetative organisms to see if there are unique sequences of chloroplast DNA. The process of the experiment was to track the chloroplast gene in each generation by clustering the genes in nucleoids to reduce the number of segregated units. This study was done at the Duke University in the Zoology Department^[8] where Karen P. VanWinkle-Swift utilized Pedigree Diagrams to show how the traits and sequences were passed down from parent to child.

In genetics, Cosegregation in Genome architecture mapping (GAM) is another process being used to identify the compaction and adjacency of genomic windows. In a study from 2017, cosegregation was used to understand gene-expression-specific contacts in organizing the genome in mammalian nuclei in the larger process of GAM.^[3] The results of the study produced complex 3D structures that displayed interactions under certain regions of chromatin contacts and proved that GAM is a useful tool in the genome biologist's skill set that expands the ability to finely dissect 3D chromatin structures, cell types and valuable human samples. A study in 2021 "discovered extensive 'melting' of long genes when they are highly expressed and/or have high chromatin accessibility. The contacts most specific of neuron subtypes contain genes associated with specialized processes, such as addiction and synaptic plasticity, which harbour putative binding sites for neuronal transcription factors within accessible chromatin regions."^[9] Both of these studies used mice as models due to their anatomical, physiological, and genetic similarity to humans.^[10]

Usage

Overview

In genetics, cosegregation analysis is used to examine how multiple genetic factors are inherited together and how their interactions contribute to biological traits or conditions. Cosegregation is particularly useful in cases where a single gene does not completely explain the presence of a specific trait. By detecting patterns where genetic variants occur together, researchers can identify relationships between certain genes and analyze combinations of factors and how they influence outcomes. One example of this would be looking at a disorder that is associated with a particular gene, but is not consistently observed in those who carry that gene, cosegregation analysis can identify addition interacting genes that may contribute to the condition.

Cancer Research

Cosegregation is being actively used in medical fields like cancer research.^[11] Many forms of cancer are not caused by a single mutation or gene, but rather a combination of multiple changes that disrupt normal processes. By using cosegregation analysis, researcher can use it to highlight the strongest connections between genes in cases where cancer develops. This approach helps to show complex and in-depth relationships and interactions between genes to further research into diagnosis and treatment of specific cancer types.

Computational Biology

Cosegregation analysis is used widely in computational biology to study relationships between regions of a genome and to quantify patterns of genomic association. In this area of study, genomic loci or windows are analyzed to observe how frequently they are detected together in a sample. Detection frequencies are used to calculate cosegregation values which can then be normalized to show the strength of each connection.^[12]

Examples of using cosegregation in genetics

An example of an application using cosegregation would be finding the normalized linkage disequilibrium (NLD) between two loci. Given a 2D dataset (row = genomic window slice, column = nuclear profile (NP)) a "1" was displayed if an NP existed in a window or a "0" otherwise. From this data, the NLD could be found using the base $linkage$ disequilibrium and its theorized maximum ( $dmax$ ). The amount of NPs present in loci (genomic windows) $A$ and $B$ , is then used to find the $detectionfrequencies$ , $f_{A}$ and $f_{B}$ and the co-segregation which is, $f_{AB}$ . After the NLD is found between two loci, it was then placed into another dataset to be visualized and then analyzed to determine how interconnected a loci is. This example was executed using python for computation and visualization of the given data and results and in finding the NLD. Using the NLD further analysis can be done to place the windows into "communities". To showcase this a graph to the right will show the community of one of the windows with the highest centrality which uses the average of the window's NLDs.

An alternative method to using Normalized Linkage Disequilibrium is Normalized Pointwise Mutual Information (NPMI). NPMI measures how closely two loci are associated by taking the log of their joint cosegregation probability, $f_{AB}$ , divided by their independent probabilities, $f_{A}f_{B}$ . This log is then divided by the log of their joint probability, $f_{AB}$ to normalize the result.

Both NLD and NPMI range between -1 and 1 and reflect how the joint cosegregation probability deviates from what would be expected if the two loci were independent. However, they differ in scope as NLD measures linear relationships, while NPMI can capture more complex, non-linear relationships between the loci.^[13]

Formulas for the example above
Calculations	Formulas^[3]
Detection Frequency	$\left({\frac {A}{N}}\right)$ or $f_{A}$
Linkage	$\left({\frac {AB}{N}}\right)-\left(\left({\frac {A}{N}}\right)\left({\frac {B}{N}}\right)\right)$ or $f_{AB}-(f_{A}f_{B})$
Linkage maximum (dmax)	$dmax={\begin{cases}min(f_{A}f_{B},(1-f_{A})(1-f_{B})),&{\text{when }}linkage<0\\min(f_{B}(1-f_{A}),f_{A}(1-f_{B})),&{\text{when }}linkage\geq 0.\end{cases}}$
Normalized Linkage Disequilibrium (NLD)	$NLD={\frac {linkage}{dmax}}$
Normalized Pointwise Mutual Information (NPMI)	$NPMI=-{\frac {{\text{log}}\left({\frac {f_{AB}}{f_{A}f_{B}}}\right)}{{\text{log}}(f_{AB})}}$

Formula

pseduo-code — pseudo-code showcasing the implementation of co-segregation in data science.

Formula for finding co-segregation given a GAM table showing if a loci is present in a slice of a genomic region
Formula^[3]	Variables
$\left({\frac {AB}{N}}\right)$ or $f_{AB}$	Variables $A$ and $B$ are the total number of nuclear profiles (NP) present in a given a detected genomic region slice, $N$ is the total number of NPs and $f_{AB}$ is the frequency of $A$ and $B$ .

This formula can be easily programmed into code as seen in the pseudo-code in the figure to the right. The code was written to satisfy the Example described above.

Advantages

Given a large dataset of nuclear profiles, cosegregation is highly scalable due to its relatively simple mathematical formulation. The larger the data set that is provided, the more accurate the following equations will be. As depicted in the photo below, the amount of data being added to the equation merely adds linear time adjustments to the original equation.

In addition, cosegregation analysis scales efficiently with dataset size and can incorporate multiple loci of interest to determine the interaction probability. Since each additional locus introduces only a single additional computation, the method exhibits linear time complexity. The picture below shows how the amount of loci affects the detection frequency equation.

Cosegregation analysis is also valuable in computational biology because it enables genomic data to be represented as matrices and networks. These representations allow for the application of graph-based methods, such as community detection and centrality analysis, to identify clusters of interacting genomic regions and highly connected loci.

In addition, cosegregation values can be visualized using heatmaps and network diagrams, which improve the interpretability of complex genomic interaction patterns. These visualizations help reveal structural features such as chromatin domains and interaction hubs.

The resulting numerical values can be used to infer properties such as radial positioning, chromatin compaction, and the strength of interactions between genomic regions. ^[14]

Limitations

Effective cosegregation analysis depends on the quality and size of the dataset. Small inaccuracies in detection frequency can be amplified, leading to misleading interaction signals. As a result, large and well-controlled datasets are required to ensure reliable results. Cosegregation identifies statistical relationships between genomic loci but does not establish causation. For example, locus cosegregation can identify genes that frequently co-occur, but these relationships may represent correlated, anti-correlated, or independent interactions. Therefore, additional analytical methods, such as normalized linkage disequilibrium, are often required to validate and interpret these relationships.

For example, consider a hypothetical dataset involving several genes associated with cancer. Here we are examining a suspect gene and three other genes that are suspected to be involved in the processes. This chart shows a hypothetical data set of 10 people and their cancer status as well as if they possess the four genes of interest. Looking at the graph, there is a clear connection between the suspect gene and Gene A. There is also a less obvious interaction between the suspect gene and Gene C that only takes place when Gene B is absent. It is entirely possible that co-segregation would have a hard time determining that relationship. Gene B is commonly present with Gene A and that combination does result in cancer. In a real data set with hundreds or even thousands of genes being examined, one could erroneously conclude that Gene B contributes to the cancer when, in reality it does not and can actually prevent it. This illustrates how indirect or conditional relationships between variables may be difficult to detect using cosegregation alone.

Additionally, cosegregation analysis is sensitive to threshold selection when constructing interaction networks. Different cutoff values used to define significant interactions can substantially alter the structure of the resulting network, affecting downstream analyses such as community detection and hub identification.

Another limitation is that many mapping techniques capture both meaningful and random genomic contacts. Regions that are closer in linear genomic distance are more likely to appear correlated due to random interactions, which can inflate cosegregation scores. Techniques such as Genome Architecture Mapping (GAM) help mitigate some of these limitations by enabling the estimation of expected interaction frequencies and reducing noise from random contacts. GAM also avoids biases introduced by ligation-based methods and may require fewer samples compared to chromosome conformation capture approaches. ^[15]

Visualizations

Matrices

Matrices are a rectangular structured array of numbers (entries) where the entries can be summed, subtracted, multiplied, and divided using the standard math operations. In the case of co-segregation, Graph theory is used to see if a variable shares an edge or vertex with another variable on a network of nodes. Graph theory is the mathematical study of objects using pairwise relations that is shown through connected nodes called vertices that are connected to other nodes by edges.

The image above depicts the conversion from a cosegregation matrix to an adjacency matrix is one use of a matrix in genome architecture mapping where scientists are using cryosectioning to find colocalization between DNA regions, genomes, and/or alleles. In that example, cosegregation is being used to describe the linkage of data to each other in terms of the distance between specific windows in a genome. The values in the cosegregation matrix were found using the formula above. Comparing windows $A and B$ , the formula seeks to find the intersection of Nuclear Profiles between the respective windows. The genomic windows would be the nodes and the adjacency graph is the matrix depiction of the edges connecting each node.

Heat maps

A heat map is a visual representation of a matrix of $m \times n$ that can show different phenomenons on a two-dimensional scale. Heat maps have a range of color intensities based on the values and scale given from the data. Coding-wise, heat maps can be created using Python libraries such as plotly.express, matplotlib, and seaborn.heatmap. To interpret co-segregation, heat maps are used to visualize a matrix that contains binary values of either 1 or 0, which can indicate the commonalities between 2 or more variables; does variable x match with variable y, yes or no, 1 or 0.

"The primary benefit of using heat maps is that they make otherwise dull or impenetrable data understandable. Many people understand heat maps intuitively, without even needing to be told that those warmer colors indicate a denser focus of interactions."^[16]

In the limitation section, there are two heat maps (also put below for easy viewing) shown depicting the difference between normalized and non-normalized data. Showing the difference in the graphs would help the researcher identify different patterns based on the intensity of the color gradients as well as the clustering of data points. Co-segregation results as seen above can have different forms and visualizing them in heat maps can aid researchers in understanding which genomes are connected similar to matrices.

The following heat maps represent this 1 or 0 value color range; it is important to use a diverging color set (as seen in these examples) that makes any distinctions in the data easy to both visualize and interpret scientifically/statistically.^[17]

In $n \times n$ heat maps, it is common to see a strong diagonal line (1 or 0), since any element will match when compared to itself on the opposite axis.

The heat map below is a different representation of the data which uses the normalized linkage table instead of the resulting adjacency matrix. This visualization gives more variation (from -1 to 1 instead of only 0 or 1) and better shows the advantages of using a heat map.

One limitation to heat maps are that some software does not allow the use of locating specific points on the graph, especially if there are many variables. There are coding libraries such as plotly.express that can create interactive heat maps where the programmer can hover over specified points on a graph and read the exact dependent variable's value. Another limitation is that heat maps do not represent real-time data. Since heat maps work by aggregating data over time, it does not show recent changes in behavior compared to the more dominant patterns already present.^[16] To visualize a dataset more dynamically, one could adapt an implementation to use network diagrams (see next).

Network diagrams

A network diagram is a visual representation of a network, which consists of distinct nodes and edges, or the interactions between these nodes.^[18] In genetics, network diagrams can be created using co-segregation adjacency matrices.

To convert an adjacency matrix to a network diagram, one must translate the matrix elements into visual nodes and edges, where non-zero values indicate connections between nodes, thereby creating a graphical representation of the genetic interactions. Below is an image of a network diagram created using the NetworkX library in Python.

NetworkX in Python

To create network diagrams programmatically, people often use the NetworkX library in Python. NetworkX is a library that provides built-in classes and functions for creating network graphs. These functions will help you create graph objects, add nodes and edges, and draw the network. For visualizing graphs, Matplotlib is typically recommended, as it works well with NetworkX.

Example:

    #Graph object
    G = nx.graph
    
    #Add nodes
    for i in range(len(table)):
       G.add_node(i)
 
    #Add edges
    for i in range(len(table)):
        for j in range(i + 1, len(table)):
            if value > 0:
                G.add_edge(i,j, weight=value)
 
     #Draw/plot network
     plt.figure(figsize=(6, 6))
     pos = nx.spring_layout(G)  #or layout of choosing
     nx.draw(G, pos, with_labels=True, node_color='lightblue',  
             node_size=800, edge_color='gray', 
             font_weight='bold') 
     plt.title("Network Graph from Adjacency Matrix")
     plt.show()

^[19]

This graph represents the interaction between two nodes, where each node, in this case, corresponds to a genomic window, and each edge indicates a cosegregation relationship. NetworkX also supports edge and node attributes. For instance, with nx.draw, you can change all the features within the draw, such as with_labels, node_color, etc., and NetworkX supports multiple layout types, such as random_layout, spring_layout, shell_layout, etc., allowing you to visualize your graph better and obtain more information.^[19]

Graph Layouts

NetworkX provides multiple layout types when visualizing your network graphs. These layouts will determine where each node and edge will be positioned to reveal different patterns or relationships among the nodes. ^[20]

Spring Layout

The spring layout is the default algorithm when visualizing network graphs. This layout positions nodes using the Fruchterman-Reingold force-directed algorithm, which treats edges as springs that hold connected nodes close together, while treating nodes as repelling objects. This will continue until the node positions reach a state of equilibrium.^[21] This will produce organized diagrams that will reveal clusters. ^[22]

Shell Layout

The shell layout positions nodes in concentric circles,^[23] which helps for visualizing (in cosegregation) organized by functional categories, levels, or chromosomal positions. Nodes will be arranged on one or more concentric circles (shells), with each shell specified as a list of nodes. Nodes that are in the same shell have the same distance from the center. Edges will be drawn between nodes in adjacent/other shells.

Circular Layout

The circular layout arranges nodes evenly in a circle, with edges drawn between them based on the graph’s connectivity.^[24] The circular layout is often used to capture ring and star topologies but can be used for other networks as well.^[25] It can be used for social networks, WWW graphs, and clustering networks. When looking at symmetry or relationships between nodes in a consistent order, the circular layout makes these patterns easier to see. ^[26]

In cosegregation networks, circular layouts can be used to show relationships among genomic regions in a consistent ordering.

Random Layout

Random layout assigns each node a random position within a unit square.^[27] Node placement doesn’t reflect any structural properties of the graph, so it is often not used for analysis. However, this layout is simple and quick to generate, and is often used for comparison against other more meaningful layout methods or to just get initial seeding for iterative algorithms.

Spectral Layout

Unlike layouts such as spring or circular, the spectral layout is based on the mathematical properties of the graph—specifically the eigenvectors of the graph Laplacian matrix.^[28] This allows it to position nodes in a way that reflects the overall structure of the network. The spectral layout is particularly useful for identifying clusters or partitions within a graph, since nodes that are more closely related tend to be positioned closer together based on the graph’s structure.^[29] This makes it especially valuable for cosegregation analysis, where the goal is to detect groups of nodes that consistently associate with each other.