{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Finding biological circuit motifs\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Design principles**\n",
    "\n",
    "* Larger circuits can be understood in terms of recurring functional modules.\n",
    "* Searching for such motifs can help identify these modules.\n",
    "\n",
    "**Key concepts**\n",
    "\n",
    "* Circuit motifs can be identified by comparing the graphs of actual circuits with randomized ensembles of statistically similar circuits.\n",
    "\n",
    "**Techniques**\n",
    "\n",
    "* Graph theory approaches\n",
    "\n",
    "_This lesson is based on Chapter 6 of the first edition of Alon, Introduction to Systems Biology._\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Circuit motifs\n",
    "\n",
    "Today we return to, and expand on, a concept briefly introduced in the previous lecture: circuit motifs. \n",
    "\n",
    "### Historical context\n",
    "In the early 2000s, the combination of automation and high-throughput assays began to transform biology. Previously, molecular biologists had generally focused on analyzing specific systems involving handfuls of genes and their corresponding functions. But with the introduction of DNA microarrays and other technolopgies able to read out expression of thousands of genes in parallel, it became possible to move assays to the genome scale. Instead of analyzing a few genes at a time, researchers began to ask how every gene responded to a perturbation, which genes were co-regulated by a particular transcription factor, and so on. \n",
    "\n",
    "A new challenge emerged: could one use these high-throughput approaches to gain a broader view of the structure of cellular circuits and to \"see,\" for the first time, the  complete circuit diagram of a cell? A top down, and relatively quantitative, view of cellular circuitry, it was hoped, could enable the construction of mathematical models that would enable us to understand, predict, and, ultimately, control cellular behavior with greater precision and insight than previously possible. \n",
    "\n",
    "It turns out that obtaining large data sets and extracting meaningful insights from them are two very different things! While the goals of this systems biology vision have not yet been fully realized, the approaches have transformed the way we do biology and the way we concpetualize questions about cellular function.\n",
    "\n",
    "One of the most important scientific breakthroughs during this period was the discovery of **circuit motifs.** The identification of motifs showed how one could use high throughput data sets to obtain powerful new functional insights into genetic circuits.\n",
    "\n",
    "Today we are going to try to define motifs and show how they can be systematically discovered from maps of regulatory interactions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A circuit can be represented as a directed graph\n",
    "\n",
    "A regulatory circuit can be thought of abstractly as a **directed graph**--a set of nodes and arrows connecting them. For the transcriptional circuitry of a bacterium, each node could represent an operon, and each arrow could represent regulation of one operon (tip of the arrow) by a transcription factor in another operon (base of the arrow).\n",
    "\n",
    "A **circuit motif** (or ‘network motif’) is defined as an over-represented regulatory pattern--or **sub-graph** within the overall circuit or graph. To be a 'motif', and not just a pattern, it should occur significantly more frequently than one would expect under some reasonable null hypothesis. \n",
    "\n",
    "The enrichment of the motif within the larger circuitry indicates that it has been selected by evolution, repeatedly. This selection suggests that the motif could play an important role as a functional module within the larger circuit.\n",
    "\n",
    "<div style=\"width: 300px; margin: auto\">\n",
    "\n",
    "![simple graph](figs/simple_graph.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "In the last lecture, we briefly introduced this principle in the context of single-gene autoregulation. \n",
    "\n",
    "We noted:\n",
    "\n",
    "* _E. coli_ has ≈424 operons, and ≈519 transcriptional regulatory interactions (\"arrows\")\n",
    "* We only expect about one autoregulatory arrow \"by chance,\" assuming a given arrow selects its target operon randomly.\n",
    "* However, we observe ≈40 _auto_regulatory arrows (32 negative, 8 positive)\n",
    "* Autoregulation seems to be over-represented, and is therefore a \"motif.\" This suggests that it might provide a useful function.\n",
    "* To think about that function might be, we analyzed the dynamics of negative and positive autoregulatory circuits. We found that negative autoregulation accelerates turn-on times, and positive autoregulation, when coupled with ultrasensitivity, can generate bistability.\n",
    "\n",
    "Today we will try to develop this approach in more detail and apply it to larger motifs, involving 3 genes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random graphs enable motif identification\n",
    "\n",
    "Consider a directed graph representing a hypothetical regulatory circuit (left). To find over-represented patterns in this graph, we can compare it to an ensemble of randomized variants and look for sub-graphs that occur more frequently in this circuit than in most of the variants. \n",
    " \n",
    "<div style=\"width: 600px; margin: auto\">\n",
    "\n",
    "![unlabeled circuit ensemble](figs/unlabeled_circuit_ensemble.png)\n",
    "\n",
    "</div>\n",
    " \n",
    "*Schematic comparison between a specific circuit (left) and a set of randomized variants with the same incoming and outgoing arrow distributions (right). Image modified from [Milo et al, _Science_, 2002](https://doi.org/10.1126/science.298.5594.824)*.\n",
    "\n",
    "\n",
    "To make the comparison as fair as possible, we need to design variant graphs that maintain the key statistical properties of the original graph. Specifically, we insist that all variant graphs have the same number of nodes and arrows. But this is not really enough: It wouldn't make sense to compare this graph, whose arrows are distributed over all the nodes to a variant in which, say, all arrows stem from a single source node, or converge on a single target node. Therefore, we also need to impose a stronger constraint: maintaining the exact distribution of incoming and outgoing arrows for each node. \n",
    "\n",
    "If you examine the graphs above, you can see that they were constructed to have the same numbered nodes. Furthermore, the number of arrows exiting and entering each numbered node is exactly the same in all graphs. For example, node 4 always has 2 outgoing arrows. Node 6 always has one incoming and two outgoing arrows, node 12 always has 2 incoming arrows, and so on. \n",
    "\n",
    "To generate the randomized graphs, imagine cutting each arrow in half with a scissors to generate a dangling \"+\" end connected to an arrowhead and a dangling \"-\" end connected to a source node. Then one would tie each \"+\" end to a randomly chosen \"-\" end. Voila, a new randomized graph is guaranteed, through this procedure, to maintain the same joint distribution of incoming and outgoing arrows. For more details, see the algorithms in [Shen-Orr et al, _Nat. Genet._, 2002](https://doi.org/10.1038/ng881) and in [M. E. J. Newman et al, _Phys. Rev. E_, 2001](https://doi.org/10.1103/PhysRevE.64.026118).\n",
    "\n",
    "To implement this type of algorithm, as described in [Milo et al, _Science_, 2002](https://doi.org/10.1126/science.298.5594.824), it is useful to represent the directed graph as a binary non-symmetric square matrix, $\\mathsf{M}$. An example of a matrix representation of a graph is shown below. Here, an arrow from node $i$ to node $j$ is represented by $M_{i,j}=1$. The absence of such an arrow would be represented by $M_{i,j}=0$. (This type of matrix, representing a corresponding graph, is called an **adjacency matrix.**) With this algorithm, one can generate a large ensemble of randomized graphs. Then, one can ask which sub-graphs are significantly more (or less) prevalent in the actual graph than in the randomized variant graphs. \n",
    "\n",
    "<br />\n",
    "\n",
    "<div style=\"width: 500px; margin: auto\">\n",
    "\n",
    "![matrix representations 2](figs/matrix_representations2.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "_Different views of the same graph, using dots and arrows (left) or as a binary matrix, shown either with 0s and 1s or in black and white (right)_.\n",
    "\n",
    "<br/>\n",
    "\n",
    "**Questions:**\n",
    "- What algorithm would you use to generate the randomized graphs (matrices)?\n",
    "- How would you control for the bias of smaller motifs in the frequency of larger motifs?\n",
    "\n",
    "<!--\n",
    "\n",
    "JB note: I'm cutting this because it's not really necessary.\n",
    "\n",
    "Note: The supplementary information of Milo et al, _Science_, 2002, available [here](http://www.weizmann.ac.il/mcb/UriAlon/sites/mcb.UriAlon/files/uploads/NMpaper/networkmotifssmd.pdf).\n",
    "\n",
    "-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding 3-node motifs in a bacterial transcriptional network\n",
    "\n",
    "Now that we have an algorithm to identify motifs, let's use it on the E. coli network. First we will focus on 3-node sub-graphs. These are big enough to be quite interesting, but small enough to be easily conceptualized and analyzed. There are 13 possible 3-node graphs that exclude direct autoregulation, which are shown here in a drawing from the Alon book.\n",
    "\n",
    "<div style=\"width: 400px; margin: auto\">\n",
    "\n",
    "![all three node motifs](figs/all_3_motifs.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "_Image adapted from [Milo, et al., Science, 2002](https://doi.org/10.1126/science.298.5594.824)._\n",
    "\n",
    "Now look back at the schematic graph/matrix representation above. Can you tell which of these sub-graphs are over-represented? That graph was constructed to contain a preponderance of one particular sub-graph:\n",
    "\n",
    "<div style=\"width: 50px; margin: auto\">\n",
    "\n",
    "![FFL motif](figs/FFL-motif.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "This sub-graph, termed the **Feed-Forward Loop (FFL)**, has one node that regulates a target node two ways: directly, and indirectly through the third node. \n",
    "\n",
    "To visualize this, we can highlight in red all of the FFLs in the graphs shown above, the one at left being artificially constructed and those to the right being randomly generated graphs. \n",
    "\n",
    "<div style=\"width: 600px; margin: auto\">\n",
    "\n",
    "![labeled circuit ensemble](figs/labeled_circuit_ensemble.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "We see that the FFL motif is over-represented. Of course, this is a schematic graph deliberately constructed to illustrate over-representation of this sub-graph. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens in real circuits?\n",
    "   \n",
    "To address this question, Alon & colleagues went through the _E. coli_ transcriptional circuit, and systematically counted how many times they observed each of the thirteen possible three-node sub-graphs shown above. For each one, they compared the number of observations in the real circuit to the number of times that sub-graph was observed in each member of a large ensemble of randomized circuits. They used a **z-score** to quantify over- or under-representation in units of the standard deviation of the number of occurrences for the sub-graph in the randomized circuits: $z = (n_{obs}-\\langle n\\rangle)/\\sigma$, where $n_{obs}$ is the number of times the sub-graph was observed in the actual circuit, $\\langle n\\rangle$ and $\\sigma$ denote the mean and standard deviation of the number of times it was observed in each of the randomized circuits. \n",
    "    \n",
    "They did this for several transcriptional circuits, including _E. coli_, yeast (two versions), and _B. subtilis_. Here are the results:\n",
    "\n",
    "<div style=\"width: 600px; margin: auto\">\n",
    "\n",
    "![profile of 3 motifs](figs/profile_of_3_motifs.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "Several features are striking in this plot:\n",
    "\n",
    "* There is strong over-representation of feed-forward loops (FFLs). In _E. coli_, one expects to see **7±5** FFLs by chance, but one observes this pattern **42 times** in the real circuit. \n",
    "* The over-representation of FFLs is  conserved across three distinct organisms, suggesting it is a general property of transcriptional circuits.\n",
    "* Most other sub-graphs are neither over- nor under-represented.\n",
    "* Three sub-graphs are statistically under-represented. They occur significantly less often than one would expect by chance, provoking the question of what problems they might present as components of larger circuits. These three sub-graphs are all sub-graphs of the FFL sub-graph. Thus, it's not that divergent regulation (sub-graph 1) is rare, it's just that when it does occur, it does so in the context of the FFL.   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How do sub-graph abundances scale?\n",
    "\n",
    "Next, let's try to get a better intuition for how often we expect to see various sub-graphs appear in graphs of different sizes. \n",
    "\n",
    "We will ask: What is the probability of obtaining an n-vertex sub-graph, $G$, by chance in a large random graph with $N$ nodes and $E$ edges? Clearly the answer will depend on the number of nodes, $n$, and edges, $g$, in $G$. For the FFL, $n=3, g=3$. \n",
    "\n",
    "Many biological networks are **sparse**, meaning the probability, $p$, of any potential edge occurring is low. Since the number of possible edges is $N^2$, so that $p = E/N^2 \\ll 1$. For _E. coli_, $p≈0.002$. We will therefore focus on this regime. \n",
    "\n",
    "The question then becomes, for large sparse graphs, how should the expected number of occurrences of sub-graph, $G$, scale with the size of the circuit, or number of nodes, $N$?. For this argument, we will ignore numerical prefactors, and focus only on the dependence on $N$. \n",
    "\n",
    "1. Within the overall network the number of potential instances of $G$ is the binomial coefficient $\\binom{N}{n}$, where $n$ is the number of nodes in $G$, e.g. $n=3$ for the FFL. \n",
    "2. The probability that any one of these potential instances is the desired sub-graph, is proportional to $p^g$, where $g$ denotes the number of edges in the $G$, e.g. $g=3$ for the FFL. There is also a permutation factor, representing the number of different ways that you could make the desired sub-graph from a set of 3 nodes. We neglect this factor because it is independent of $N$. \n",
    "3. Since $N \\gg n$, we can use the well-known [Stirling approximation](https://en.wikipedia.org/wiki/Stirling's_approximation) of the binomial coefficient: \n",
    "\\begin{align}\n",
    "\\phantom{blah} \\\\\n",
    "\\binom{N}{n} \\approx \\frac{N^n e^n}{\\sqrt{2\\pi} n^{n+1/2}}\\\\[1em]\n",
    "\\phantom{blah}\n",
    "\\end{align}\n",
    "This is $N^n$ times numerical factors that are independent of $N$.\n",
    "4. Thus, the expected number of occurrences scales as $N^n p^g$. \n",
    "5. Rewriting this expression provides more insight into the scaling of sub-graph abundance:\n",
    "\\begin{align}\n",
    "&\\phantom{blah} \\\\\n",
    "N_G & \\sim N^n p^g \\\\[1em]\n",
    "&= N^n \\left( \\frac{E}{N^2} \\right)^g \\\\[1em]\n",
    "&= N^{n-2g} E^g \\\\[1em]\n",
    "&= N^{n-g} \\left( \\frac{E}{N} \\right)^g\\\\[1em]\n",
    "\\phantom{blah}\n",
    "\\end{align}\n",
    "The thing to notice here is that the N-dependence can be captured by an extensive term that scales as $N^{n-g}$ times a function fo the graph's mean connectivity, $E/N$, which is an intensive (N-independent) property. \n",
    "\n",
    "This result leads to interesting and counterintuitive expectations:\n",
    "\n",
    "<div style=\"width: 250px; margin: auto\">\n",
    "\n",
    "![motif scaling](figs/motif_scaling.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "* Sparsely connected sub-graphs increase in frequency with $N$\n",
    "* Densely connected sub-graphs decrease in frequency with $N$. This means that as the graph grows larger there are actually *fewer* of these motifs.\n",
    "* And in between, sub-graphs with the same number of nodes and edges, such as the FFL, actually have a constant expected number regardless of $N$. \n",
    "\n",
    "Those last two points initially struck me as very counter-intuitive. How can the absolute number of times a graph appears in the network be independent of the size of that network? Surely, if you have more network you should have more of all subgraphs. But, loosely speaking, as the number of nodes increases the probability of an arrowhead landing on a node within a potential sub-graph decreaes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Beyond transcriptional circuits\n",
    "\n",
    "It's important to remember that these are expectations for (a particular class of) random networks. Real networks will of course differ, because they are designed, or evolved, to perform specific functions. In fact, different types of systems seem to show quite distinct motif profiles:\n",
    "\n",
    "<div style=\"width: 600px; margin: auto\">\n",
    "\n",
    "![superfamilies](figs/superfamilies.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "_Image taken from [Milo, et al., Science, 2004](https://doi.org/10.1126/science.1089167)_\n",
    "\n",
    "\n",
    "Here, the top group are the transcriptional circuits we already saw. The next group are biological signal transduction circuits of different types from different species. The third group are networks like the world-wide web, for which a directed edge indicates a link from one web page to another. Finally, the lowest group represents motifs in language graphs, in which nodes are words and edges are derived from the probability that one word follows another. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## There are many kinds of FFLs\n",
    "\n",
    "Our description of regulatory interactions so far has been oversimplified in several ways: \n",
    "* We have collapsed positive and negative regulation together. \n",
    "* We have not considered how multiple regulators combine to control expression of a mutual target operon. \n",
    "* We have ignored the quantitative strength of the regulation. \n",
    "\n",
    "Understanding the function of a motif requires thinking about these aspects more carefully. \n",
    "\n",
    "We can classify the overall FFL motif into $2^3=8$ different categories depending on which of its 3 arrows are positive or negative:\n",
    "\n",
    "<div style=\"width: 300px; margin: auto\">\n",
    "\n",
    "![FFL classes](figs/FFL_classes.png)\n",
    "\n",
    "</div>\n",
    "\n",
    "_Image taken from [Alon, Nature Rev. Genet., 2007](https://doi.org/10.1038/nrg2102)_\n",
    "\n",
    "We can then further classify the FFLS, according to how the regulatory arrows converging on the third node (now labeled \"Z\") combine. In AND regulation, both X and Y need to be simultaneously present at high levels for Z to be expressed. in OR regulation, either input is sufficient to activate Z. \n",
    "\n",
    "In the next lecture, we will consider what functions the various FFLs can perform for cells.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aside: _sequence motifs_ identify cis-regulatory sites\n",
    "\n",
    "The motif principle is quite general. Before it was applied to network motifs, it was used to identify functional motifs in genome sequences. For example, [Bussemaker et al, _PNAS_, 2000](https://doi.org/10.1073/pnas.180265397) presented an algorithm called _MobyDick_ that identifies \"words\" (letter sequence motifs), in any \"text\" (sequence of letters). They tested it out on the text of the classic novel Moby Dick, with spaces and punctuation removed, recovering a dictionary of English words that appear in the novel. Then, they used it on the yeast genome sequence to identify cis-regulatory sites, including binding sites\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "* Motifs are statistically over-represented patterns in a circuit.\n",
    "* To find motifs, we construct ensembles of random circuits with the same arrow distributions for comparison.\n",
    "* The abundances of different sub-graphs can scale in opposite and even counter-intuitive ways as the size of the overall circuit grows\n",
    "* This approach can be applied to a wide variety of systems that can be represented as graphs.\n",
    "* Transcriptional regulatory circuits appear to have a single major motif: the feed-forward loop.\n",
    "* There are multiple types of FFLs.\n",
    "\n",
    "In the next lecture, we will explore the functions of this motif.\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "position": {
    "height": "730.4000244140625px",
    "left": "486.6000061035156px",
    "right": "20px",
    "top": "120px",
    "width": "517.4000244140625px"
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}