Elisa Chao*1, Connor Chato*1, Reid Vender1,2, Art F. Y. Poon1

1Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada.

2School of Medicine, Queen's University, Kingston, Ontario, Canada.

*denotes equal contribution

In the field of epidemiology, source attribution refers to a category of methods with the objective of reconstructing the transmission of an infectious disease from a source population, individual or location, to subsequent hosts - i.e., "who infected whom".

Source attribution tends to be a problem of statistical inference, because transmission events are seldom observed directly and may have occurred in the distant past. Thus, there is an unavoidable level of uncertainty when reconstructing transmission events from residual evidence, such as the spatial distribution of the disease. As a result, source attribution models often employ Bayesian methods that can accommodate substantial uncertainty in model parameters.

Many infectious diseases are routinely detected or characterized through genetic sequencing, which can be faster than culturing isolates in a reference laboratory and can identify specific strains of the pathogen at substantially higher precision than laboratory assays, such as antibody-based assays or drug susceptibility tests. On the other hand, analyzing the genetic (or whole genome) sequence data requires specialized computational methods to fit models of transmission. Consequently, source attribution is a highly interdisciplinary area of molecular epidemiology that incorporates concepts and skills from mathematical statistics and modeling, microbiology, public health and computational biology.

Due to the associated stigma and the criminalization of transmission for specific infectious diseases, source attribution at the level of individuals can be a controversial use of data that was originally collected in a healthcare setting, with potentially severe legal consequences for individuals who become identified as putative sources. In these contexts, the development and application of source attribution methods may involve trade-offs between public health responsibilities and individual rights to data privacy.

## Microbial subtyping

Microbial subtyping or strain typing is the use of laboratory methods to assign a microbial sample to one of potentially many predefined types or categories. Today it is more common to use genetic or whole-genome sequencing to characterize the microbial sample at the level of its nucleotide sequence. However, other molecular methods such as restriction length fragment polymorphism have historically played an important role in microbial subtyping before genetic sequencing became an affordable and ubiquitous technology in reference laboratories. Characterizing a microbial sample at a molecular level enables one to distinguish one specimen from another.

The assignment of specimens to types or subtypes forms the basis of source attribution for pathogens with relatively slow rates of evolution, such that few mutations are observed on an epidemiological time scale[1]. For example, if host A is infected by a pathogen of subtype Z, then it is more likely that the pathogen was transmitted from another host (B) who was also diagnosed with a type Z infection than an infection of another type. In other words, transmission from host B is a more parsimonious explanation if there is a relatively small probability that the pathogen population in another host evolved from a different subtype to subtype Z prior to transmission. If host A carries the same infection subtype as many other hosts, then we have no information to differentiate one potential source from another. Our ability to identify a subset of potential sources from the population depends, therefore, on having a sufficient number of different subtypes. For this reason, sequence-based typing methods confer an advantage over other laboratory methods (such as serotyping or pulsed-field gel electrophoresis) because there is a potentially enormous number of potential subtypes that can be resolved at the level of the genetic sequence, and many pathogens accumulate high levels of genetic diversity. Encountering too many types on the population, however, makes it likely that every individual carries a unique subtype. Hence, there exists an intermediate level of type resolution that confers that greatest amount of information for source attribution. This limit may be overcome by using a clustering method to group multiple sequence variants into the same subtype.

### Single and multi-locus typing

Before whole-genome sequencing became more cost-effective, targeting specific parts of the pathogen genome has been an important step to facilitate microbial subtyping. For example, the ribosomal gene 16S is a standard target for identifying bacteria, in part because it is present across all known species and contains a mixture of conserved and variable regions. Within a pathogen species, sequencing targets tended to be selected on the basis of their length, ubiquity and exposure to diversifying selection, which may be dictated by the function of the gene product for expressed sequences. For example, so-called "housekeeping genes" have indispenable biological functions, such as copying genetic material or building proteins. These genes are often preferred candidates for microbial subtyping because they are less likely to be absent from a given genome. This is particularly relevant for bacteria where genetic material is frequently exchanged through horizontal gene transfer.

Targeting multiple regions (loci) of the pathogen genome confers greater precision in distinguish one lineage from another, because it increases the chance of observing informative genetic differences among infections. This approach is referred to as multi-locus sequence typing (MLST)[2]. Like single-locus typing, MLST still requires the selection of specific loci to target for sequencing. To make subtyping consistent across laboratories, a reference database must be maintained that maps sequences from single or multiple loci to a fixed notation of allele numbers or designations.

#### Whole genome sequencing

Although single- and multiple-locus subtyping is still predominantly used for molecular epidemiology, ongoing improvements in sequencing technologies and computing power continue to lower the barrier to whole-genome sequencing. Next-generation sequencing (NGS) methods provide cost-effective methods to generate whole genome sequences from a given sample by individually amplifying and sequencing templates in parallel using customized technologies such as sequencing-by-synthesis[3]. Shotgun sequencing applications of NGS generate full-length genome sequences by shearing the nucleic acid extracted from the sample into small fragments that are converted into a sequencing library, and then using a de novo sequence assembler program to reconstitute the genome sequence from the sequence fragments (short reads)[4]. Alternatively, short reads can be mapped to a reference genome sequence that has been converted into an index for efficient lookup of exact substring matches. This approach can be faster than de novo assembly, but relies on having a reference genome that is sufficiently similar to the genome sequence of the sample. While NGS makes it feasible to simultaneously generate full-length genome sequences from hundreds of pathogen samples in a single run, it introduces a number of other challenges. For instance, NGS platforms tend to have higher sequencing error rates than conventional sequencing, and regions of the genome with long stretches of repetitive sequence can be difficult to reassemble.

If investigators have access to both an NGS facility and bioinformatic support for processing the raw data, then whole genome sequencing can confer a significant advantage for source attribution over single- or multiple-locus subtyping. For example, whole-genome sequence data revealed differences between isolates of Burkholderia pseudomallei from Australia and Cambodia that had otherwise appeared to be identical by multi-locus subtyping due to convergent evolution[5]. Whole-genome sequencing has also been utilized in several recent studies to resolve transmission networks of Mycobacterium tuberculosis in greater detail, because isolates with identical multi-locus subtypes (MIRU-VNTR profiles) were frequently separated by large numbers of nucleotide differences[6][7].

### Genetic clustering

When applied to genetic sequences, a clustering method is a set of rules for assigning the sequences to a smaller number of clusters such that members of the same cluster are more genetically similar to each other than sequences in other clusters. Put another way, a clustering method defines a partition on the set of genetic sequences using some similarity measure. Clustering is inherently subjective and there are usually no formal guidelines for setting the clustering criteria. Consequently, cluster definitions can vary substantially from one study to the next. On the other hand, clustering is an intuitive process that can be accomplished by a wide variety of approaches; because of this flexibility, numerous different methods of genetic clustering have been described in the literature[8].

#### Defining subtypes

Although many viruses have compact genomes that are feasible to sequence to completion using conventional sequencing techniques, routine clinical genotyping tends to target specific regions of the virus genome. Virus genes encoding proteins that are exposed to host-specific selection make popular targets for subtyping because they tend to be the most diverse, and may have significant associations with transmission and pathogenesis. For instance, the influenza A virus genes encoding hemagglutinin and neuraminidase are both used for subtyping. Since the gene composition of virus genomes tends to be more consistent than bacteria, however, most genomic regions are informative for subtyping and laboratories may use different targets. The use of different targets is feasible for virus subtyping because subtypes are defined as clusters of genetically similar sequences, where different fragments are compared to one of a number of subtype reference genomes. Indeed, it is often not feasible to employ a one-to-one map of unique virus sequences to subtype definitions because many viruses exhibit a relatively rapid rate of evolution.

Genetic clustering provides a way of dealing with rapidly evolving pathogens for which there can be an enormous number of distinct genetic sequences. If each subtype must correpond to a unique sequence variant, then one could potentially have to track an unwieldy number of microbial subtypes for these pathogens. The number of subtypes can be greatly reduced by expanding the definition of microbial subtypes from individually unique sequence variants to clusters of similar sequences. For example, pairwise distance clustering is a nonparametric approach in which clusters are assembled from pairs of sequences that fall within a threshold distance of each other. The distance between sequences is computed by a genetic distance measure (a mathematical formula that maps two sequences to a non-negative real number) that quantifies the evolutionary divergence between the sequences under some model of molecular evolution. Because this approach makes direct comparisons between sequences, it does not account for their common ancestry.

Other nonparametric methods derive clusters from the phylogenetic tree that is reconstructed from the sequences to model how they are related to common ancestors. A cluster may be defined, for instance, by the set of all tips that descend from a specific ancestral node in the tree, which is referred to as a subtree. Subtree clusters can be extracted from the tree based on the statistical confidence in the composition of the subtree (see Node support), or by comparing the distribution of branch lengths within the subtree relative to the rest of the tree. Since common ancestry is inferred from the genetic similarity of descendants, pairwise distance and subtree clusters can yield similar results. For example, an early proposal for a hierarchical nomenclature system to classify HIV-1 infections recommended the use of both pairwise distance and phylogenetic analyses for designating a new subtype or sub-subtype [9].

#### Transmission clustering

Genetic clustering is frequently used for molecular epidemiology applications in which clusters of infections are interpreted to be epidemiologically related. For instance, infections that share a high degree of genetic similarity are implied to be related through a series of recent transmission events. In addition, the phylogenetic tree relating infections can be similar in shape to their transmission history[10]; however, there are several reasons why a phylogenetic tree will be discordant from the underlying transmission history (see Inferring transmission history from the phylogeny). Consequently, clusters based on genetic distances or phylogenetic criteria are often referred to as "transmission clusters". For pairwise distance clusters, for instance, each link between cases in the same cluster represents only their genetic similarity and not a transmission event. Variation in transmission rates is not the only determinant of genetic clustering; in fact, several studies have demonstrated that clustering is also strongly influenced by the time between infection and sampling[11][8]. Furthermore, the addition of newly-obtained genetic sequence data to a set of clusters can rearrange the associations between infections. For example, clusters that were originally separate and distinct may be merged through a previously unsampled sequence. In phylogenetic applications, a new sequence may intersect the original phylogenetic tree at a point that separates two infections that previously located in adjacent positions in the tree. For these reasons and more, genetic clustering is not by itself a viable method of source attribution (see Forensic applications of phylogenetic clustering).

## "Dutch"/Hald models and Bayesian inference

The "Dutch model" [12] was originally developed to estimate the most likely source of a number of foodborne illnesses due to infection by Salmonella by comparing the relative frequencies of bacterial subtypes in different commercial livestock populations (including poultry, swine and cattle) through routine surveillance programs. For a given subtype, the expected number of human cases attributed to each source is proportional to the relative frequencies of that subtype among sources:

${\displaystyle \lambda _{ij}={\frac {p_{ij}}{\sum _{j}p_{ij}}}n_{i}}$

where ${\displaystyle p_{ij}}$ is the proportion of cases in ${\displaystyle j}$-th source population associated with subtype ${\displaystyle i}$, and ${\displaystyle n_{i}}$ is the number of cases of subtype ${\displaystyle i}$ in the recipient (human) population. For example, if the frequencies of subtype X among three potential sources was 0.8, 0.5 and 0.1, respectively, then the expected number of cases (out of a total of 100) from the second source is ${\displaystyle 0.5/(0.8+0.5+0.1)\times 100=35.7}$. This simple formula is a maximum likelihood estimator when the total force of infection from each source into the human population is uniform, e.g., the sources have equal population sizes.

Fig 1. Summary of Hald model parameters. Arbitrary numbers are provided for observed quantities, such as the proportion of infections due to subtype 1 in source population 2 (${\displaystyle p_{12}}$). The marginal effect associated with source population 2 (${\displaystyle a_{2}}$) is represented by an open rectangle (solid line); while the total size of this source population ${\displaystyle M_{2}}$ is observed, ${\displaystyle a_{2}}$ must be estimated by regression. Similarly, the marginal effect associated with subtype 3 (${\displaystyle q_{3}}$), indicated by a rectangular shaded region) is simultaneously estimated by regression.

Subsequently, this model was extended by Hald and colleagues [13] to account for variation among sources and subtypes using Bayesian inference methods. This extension, typically referred to as the Hald model, has become a standard model in source attribution for food-borne illnesses. The observed numbers of each subtype in the human population was assumed to be a Poisson distributed outcome with a mean ${\displaystyle \lambda _{i}}$ for the i-th subtype, after adjusting for cases related to travel and outbreaks:

${\displaystyle \lambda _{i}=\sum _{i}\lambda _{ij}=\sum _{i}q_{i}M_{j}a_{j}p_{ij}}$

where ${\displaystyle q_{i}}$ is the marginal effect of the i-th subtype (e.g., elevated infectiousness of a bacterial variant), ${\displaystyle M_{j}}$ is the observed total "mass" of the j-th food source, ${\displaystyle a_{j}}$ is the marginal effect of the j-th food source, and ${\displaystyle p_{ij}}$ is the same observed case proportion as the original "Dutch" model. This model is visualized in Figure 1.

### Bayesian inference

The addition of a large number of parameters to the model described by Hald and colleagues yielded a more realistic model. However, it was too complex to solve for exact maximum likelihood estimates, in contrast to the original "Dutch" model. Many of the parameters could not be directly measured, such as the relative transmission risk associated with a specific food source. Consequently, Hald and colleagues adopted a Bayesian approach to estimate the model paramters. A similar approach has also been used to reconstruct the contribution of different environmental and livestock reservoirs of the bacteria Campylobacter jejuni to an outbreak of food poisoning in England [14], where the migration of different subtypes among reservoirs was jointly estimated by Bayesian methods.

Although Bayesian inference is discussed extensively elsewhere, it plays an important role in computational-intensive methods of source attribution, so we provide a brief and accessible introduction to the subject in this context. Bayesian inference stipulates that every parameter is described by a probability distribution that represents our 'belief' about its true value. For example, we are inclined to believe that the chance of a coin toss resulting in "heads" is roughly 50%. Our belief about this chance as a parameter can be represented by a roughly bell-shaped distribution centered at 0.5 over the interval 0 to 1. This is referred to as the prior distribution, because it reflects our belief about parameters prior to seeing the data.

#### Prior and posterior belief

Hald and colleagues used uniform prior distributions for many of their parameters to express the prior belief that the true value fell within a continuous range with specific upper and lower limits. When this range is set to span a wide range of values, it becomes an uninformative prior --- we are being intentionally vague about our prior belief. They constrained some parameters to take the same numerical value as others. For example, the effects of domestic and imported supplies of the same food source were linked in this matter. This assumption expressed a strong belief that a given food source carried the same transmission risk irrespective of its origin, and simplified the model so it was more feasible to fit to the data. Other parameters were set to a fixed reference value to further simplify the model.

To update our belief, we need to have a model that describes the probability of different outcomes of an experiment, such as performing ten coin tosses. For example, the probability of a series of coin tosses can be modelled by the binomial distribution. Hald and colleagues employed a Poisson model that describes the probability of ${\displaystyle Y}$ rare events that occur at a rate ${\displaystyle \lambda }$. As described above, the rate of cases due to a specific bacterial subtype was the sum of transmission rates across all potential sources. The Hald model was more realistic than the "Dutch" model because it allowed transmission rates to vary between subtypes and food sources. However, it was not feasible to directly measure these different rates --- these parameters needed to be estimated indirectly from the data.

One approach to estimate these rate parameters is to find a single combination of parameter values that maximizes the probability of the data for this model (maximum likelihood). In a complex model where substantially different parameter combinations can explain the data with similar likelihood, however, it can be more informative to summarize the parameter values supported by the data with another statistical distribution. This posterior distribution represents our updated belief about the model parameters after incorporating the information gained from collecting new data. For example, if the coin in our experiment is not balanced, we may update our belief to relocate the distribution so it is centered over a lower or higher probability of landing 'heads'.

#### Markov chain Monte carlo

In practical applications of Bayesian inference, it is often not possible to compute an exact solution for the posterior distribution given the data, likelihood function (model) and prior distribution. In these cases, the next best solution is to generate a random sample of parameter values from their joint posterior distribution. Since we 'are' able to calculate the posterior probability for a specific set of hypothetical parameter values, we can try to sample from the posterior distribution by randomly sampling a set of parameter values ${\displaystyle \theta }$ from the prior distribution, and then calculating the posterior probability ${\displaystyle P(\theta )}$ at this point. Next, we draw a random value ${\displaystyle x}$ from a uniform distribution from 0 to some constant ${\displaystyle M}$ that is greater than any posterior probability ${\displaystyle P}$. We reject ${\displaystyle \theta }$ from our sample if ${\displaystyle x, and draw another set of parameter values from the prior distribution.

Using this rejection sampling method, we can eventually accumulate a target number of parameter values. This is an example of a Monte carlo method -- we are trying to solve a problem by repeated random sampling. However, it is not an efficient sampling method. For instance, there may be an enormous range of parameter values that have a low posterior probability, so most samples are rejected. Without knowing the maximum possible posterior probability, we might also set the maximum ${\displaystyle M}$ so high that most samples are rejected.

In the original study, Hald and colleagues fit their model to the data using Markov chain Monte carlo (MCMC), which is a popular extension of rejection sampling that can address these limitations. Instead of drawing candidate parameter values from the prior distribution, the next set of values is constrained to be similar to the last parameter values that were accepted. For this reason, MCMC sampling is less likely to reject the next set of parameters. This "proposed" set is accepted if it has a higher posterior probability; even if it has a lower posterior probability, the MCMC method allows for some random chance of accepting it nonetheless. The entire processes is repeated for a large number of iterations.

Imagine, for example, that a pair of blindfolded hikers linked by a rope are traversing a mountain range. At each stage, the lead hiker climbs or walks in a random direction. If the change in elevation is positive, then the second hiker follows. If it is negative, then there is some random chance the second hiker will still follow; otherwise, they reel the first hiker back on the rope. In the long run, the amount of time the hikers spend in a particular area will be proportional to the overall elevation. Attaining this state is referred to as "convergence". This simplified analogy serves to illustrate some inherent difficulties with MCMC. If we do not spend enough time exploring, for example, then the location of the hikers will be more informative about where they started than about the regional topography. For this reason, MCMC samples need to be run for many iterations, especially if the parameter space is enormous. There is also no way of guaranteeing that the MCMC sample has converged; there is always the possiblity that a remote peak was never "sampled" by the hikers. This issue can be ameliorated by running multiple MCMC samples that start in different locations in parameter space; for example, Hald and colleagues ran five replicate MCMC samples for 30,000 iterations.

## Phylogenetic methods

A phylogenetic tree or phylogeny is a hypothesis about the common ancestry of species or populations. In the context of molecular epidemiology, phylogenies are used to relate infections in different hosts and are usually reconstructed from genetic sequences of each pathogen population. To reconstruct the phylogeny, the sequences must cover the same parts of the pathogen genome; for example, sequences that represent multiple copies of the same gene from different infections. It is this residual similarity (homology) between diverging populations that implies recent common ancestry. A molecular phylogeny comprises "tips" or "leaves" that represent different genetic sequences that are connected by branches to a series of common ancestors that eventually converge to a "root". The composition of the ancestral sequence at the root, the order of branching events and the relative amount of change along each branch are all quantities that must be extrapolated from the observed sequences at the tips.

There are multiple approaches to reconstruct a phylogenetic tree from genetic sequence variation that we will only briefly describe here (see Yang and Rannala [15] for a recent review). Distance-based methods use a hierarchical clustering method to build up a tree based on some measure of the genetic dissimilarity between two sequences. Most distance measures in current use adjust for the occurrence of multiple substitutions at the same nucleotide or amino acid of the sequence, which would cause the measure to systematically underestimate the amount of evolutionary divergence between sequences. Some source attribution methods use a genetic distance threshold to rule out potential sources of infection for individuals whose pathogen sequences exceed a threshold distance to the sequence from another individual[16][17]. Importantly, it is not possible to infer the direction of transmission from this criterion alone.

Likelihood-based methods employ a model of sequence evolution to compute the probability of the observed sequence data for a given tree. The investigator must then employ some heuristic algorithm to search the space of all possible trees to find one that maximizes this probability: the maximum likelihood estimate. Compared to distance-based methods, likelihood-based methods are computationally intensive but can accommodate more biological complexity such as variation in rates of evolution among sites or over time, thereby providing more accurate reconstructions of the phylogenetic tree [18]. Furthermore, likelihood-based methods can be incorporated into a Bayesian inference framework, which makes this class of methods particularly suitable for source attribution applications.

### Inferring transmission history from the phylogeny

The basic premise in applying phylogenetics to source attribution is that the shape of the phylogenetic tree approximates the transmission history, which can also be represented by a tree where each split into two branches represents the transmission of an infection from one host to another. It is generally easier to reconstruct a phylogenetic tree from genetic sequences than to reconstruct the transmission tree from other sources of information, such as contact tracing. Because of the visual and conceptual similarity between phylogenetic and transmission trees, it is a common mistake to interpret the branching points (splits) of the phylogeny as representing transmission events. In fact, the transmission event may have occurred at any point along the two branches that separate one sampled infection from the other in the virus phylogeny (Figure 2A). Thus, the transmission tree only constrains the shape of the phylogenetic tree. There are several reasons why these trees are not congruent, including: the time scales of transmission and pathogen evolution, incomplete sampling, pathogen evolution within hosts, and secondary infection of the same host.

Figure 2: Some challenges equating phylogeny with transmission history. The solid lines represent the phylogenetic relationship between infections that have been sampled from two or three different hosts. Shaded regions correspond to the location of virus lineages in different hosts, as indicated by colour. The transmission of a lineage from one host to another is represented by a gap between shaded regions and highlighted with a red arrow. (A) Although hosts 1 and 2 are closely related, the phylogeny does not indicate whether the infection was transmitted from host 2 to 1 (as shown), or vice versa. The transmission event may be located anywhere along the two branches connecting the hosts. (B) An infection may have been transmitted through any number of unsampled hosts before reaching the host that was sampled. (C) An unsampled host may be the source of infections transmitted to both hosts 1 and 2. (D) If pathogens establish a large diverse population within each host, the branches of the phylogeny may occur in a different order than the transmission history; as shown, hosts 1 and 3 are more closely related in the trasmission history, but not in the phylogeny.

#### Time scales

First, the phylogenetic tree must be reconstructed from observed differences between infections that have been sampled from the host population --- this tree is almost never directly observed. If there is little to no evolution has occurred among these infections (in other words, if they are genetically almost identical) then it is difficult to reconstruct the order that different lineages descended from their common ancestors. For example, this limitation is the driving force behind the growing adoption of whole-genome sequencing for the molecular epidemiology and source attribution[6] of the bacterium tuberculosis, which makes it possible to distinguish between infections that would otherwise be assigned to the same subtype according to the standard multi-locus genotyping method[7] that targets only 24 loci of the M. tuberculosis genome, which comprises roughly 4.3 million nucleotides encoding over 4,000 genes.

Even if the pathogen evolves relatively rapidly, a transmission history on a compressed time scale will make it difficult to distinguish common ancestors. Conversely, if there has been too much evolution because we are working with a transmission tree on an extremely long time scale, then the genetic similarity that implies common ancestry will have eroded. Even if there has been substantial evolution among the lineages, reconstructing their common ancestry is progressively more uncertain the deeper we go towards the root of the tree, away from the observed data.

#### Incomplete sampling

Equating the phylogenetic tree with the transmission history implicitly assumes that genetic sequences have been obtained from every infected host in the epidemic. In practice, only a fraction of infected hosts are represented in the sequence data. The existence of an unknown and inevitably substantial number of unsampled infected hosts is a major challenge for source attribution. Even if the phylogenetic tree indicates that two infections are most closely related than any other sampled infection, one cannot rule out the existence of one or more unsampled hosts whom are intermediate links in the "transmission chain" separating the known hosts (Figure 2B). Similarly, an unsampled infection may have been the source population for both observed infections at the tips of the tree (Figure 2C). By itself, the phylogenetic tree does not explicitly discriminate among these alternative transmission scenarios.

#### Evolution within hosts

The shape of the phylogenetic tree may diverge from the underlying transmission history because of the evolution of diverse populations of the pathogen within each host. The individual copies of the pathogen genome that are transmitted to the next host are, by definition, no longer in the source population. The split in the phylogenetic tree represents the common ancestor that relates the transmitted lineages with the lineages that remained and persisted in the source population. If we follow both sets of lineages back in time, the time of the transmission event is the most recent possible time that they could converge to a common ancestor. Put another way, this event represents one extreme of a continuous range where the common ancestor is located further back in time.

The amount of time we have to follow two or more lineages back in time until we encounter a common ancestor is described by Kingman's coalescent model. Simply put, this amount of time should be greater as the historical size of the population gets larger because there are more potential ancestors to choose from. Conversely the time to common ancestors should get shorter with smaller population size. Consequently, the shape of the tree relating viruses from the same host contains information about the size and dynamics of the entire virus population comprising that infection.

The coalescence of lineages in a large, diverse pathogen population within each host represents a significant challenge for source attribution. For example, if a host has transmitted their infection to two others, then there are three sets of lineages whose ancestry can be traced in the source population. As a result, there is some chance that the virus phylogeny, which represents the convergence of lineages to their common ancestors, would imply a different order of transmission events if taken at face value (Figure 2D).

#### Clearance and secondary infection

Many infections can be spontaneously cleared by the host's immune system. If a host that has cleared a previously diagnosed infection becomes re-infected from another source, then it is possible for the same host to be represented by different infections in the phylogenetic and transmission trees, respectively. In addition, some individuals may become multiply infected from different sources. For example, roughly one-third of infections by hepatitis C virus are spontaneously cleared within the first six months of infection[19]. This previous exposure, however, does not confer immunity to re-infection by the same virus[20]. In addition, co-infection by multiple strains of hepatitis C virus that persist simultaneously within the same host can occur relatively frequently in some populations with a high rate of transmission, such as people who inject drugs using shared equipment (ranging from 14% to 39%)[21]. The persistence of strains from additional exposures may be missed by conventional genetic sequencing techniques if they are present at a low frequencies within the host, necessitating the use of "next-generation" sequencing technologies. For these reasons, the epidemiological linkage of hepatitis C virus infections through genetic similarity may be a transient phenomenon, leading some investigators to recommend using multiple virus sequences sampled from different time points of each infection for molecular epidemiology applications[22].

### Ancestral host-state reconstruction

Ancestral reconstruction is the application of a model of evolution to a phylogenetic tree to reconstruct character states, ranging from nucleotide sequences to phenotypes, at the different ancestral nodes of the tree down to the root[23]. In the context of source attribution, ancestral reconstruction interprets the location of each pathogen lineage (among different host individuals) as an evolving character state. This is similar to applications of ancestral reconstruction in phylogeography[24], where the geographic location of an ancestral population is reconstructed from the current locations of its sampled descendants under some model of migration.

Migration models generally fall into two categories of discrete-state and continuous-state models. Discrete-state or island migration models assume that a given lineage is in one of a finite number of locations, and that it changes location at a constant rate over time according to a continuous-time Markov process, analogous to the models used for molecular evolution. For example, the software Phyloscanner[25] employs ancestral reconstruction to infer the location of a transmitted pathogen lineage in one host individual or another. Ancestral reconstruction with a discrete-state migration model has also been utilized to reconstruct the early spread of HIV-1 in association with development of transport networks and increasing population density in central Africa[26]. Continuous-state migration models are more similar to models of Brownian motion in that a lineage may occupy any point within a defined space. Although continuous models can more realistic than discrete migration models, they may be more challenging to fit to data. Moreover, the movement of a pathogen population by transmission from one host to another is represented more naturally as a discrete state. Continuous-state models could be employed for source attribution at the level of geographic regions, especially if precise geolocation data were available; however, we have not yet found such an application of ancestral reconstruction for source attribution in the literature.

### Paraphyly

Paraphyly is a term that originates from the study of cladistics, an evolutionary approach to systematics that groups organisms on the basis of their common ancestry. A group of infections is paraphyletic if they are related by a common ancestor that also has one or more members that are not assigned to this group. In other words, one group is nested within an ancestral group. For example, birds are descended from a common ancestor that in turn shares a common ancestor with all reptiles; thus, birds are nested within the phylogeny of reptiles, making the latter a paraphyletic group. Thus, paraphyly is evidence of evolutionary precedence: the ancestor of all birds was a reptile. In the context of source attribution, paraphyly can be used as evidence that one infection preceded another. It does not provide evidence that the infection was directly transmitted from one individual to another, in part because of incomplete sampling.

The application of paraphyly for source attribution requires that the phylogenetic tree relates multiple copies of the pathogen from both the putative source and recipient hosts. To elaborate, phylogenetic trees relating different infections are often reconstructed from population-based sequences (direct sequencing of the PCR amplification product), where each sequence represents the consensus of the individual pathogen genomes sampled from the infected host. If copies of the pathogen genome are sequenced individually by limiting dilution protocols or next-generation sequencing, then one can reconstruct a tree that represents the genealogy of individual pathogen lineages, rather than the phylogeny of pathogen populations.

If sequences from host A form a monophyletic clade (in which members comprise the complete set of descendants from a common ancestor) that is nested a paraphyletic clade of sequences from host B, then the tree is consistent with the direction of transmission having originated from host A[27]. Directionality does not imply that host A directly transmitted their infection to host B, because the pathogen may have been transmitted through an unknown number of intermediate unsampled hosts before establishing an infection in host B.

#### Node support

The statistical confidence in directionality of transmission from a given tree is usually quantified by the support value associated with the node that is ancestral to the nested monophyletic clade. The support of node X is the estimated probability that if we repeated the phylogenetic reconstrution on an equivalent data set, the new tree would contain exactly the same clade comprising exclusively of all descendants of node X in the original tree. In other words, it quantifies the reproducibility of that node given the data. It should not be interpreted as the probability that the clade below node X appears in the "true" tree[28]. There are generally three approaches to estimating node support:

1. Bootstrapping. Felsenstein adapted the concept of nonparametric bootstrapping to the problem of phylogenetic reconstruction by maximum likelihood[29]. Bootstrapping provides a way to characterize the sampling variation associated with the data without having to collect additional, equivalent samples. To start, one generates a new data set by sampling an equivalent number of nucleotide or amino acid positions at random with replacement from the multiple sequence alignment - this new data set is referred to as a "bootstrap sample". A tree is reconstructed from the bootstrap sample using the same method as the original tree. Since we are sampling sets of homologous characters (columns) from the alignment, the information on the evolutionary history contained at that position is intact. We record the presence or absence of clades from the original tree in the new tree, and then repeat the entire process until a target number of replicate trees have been processed. The frequency at which a given clade is observed in the bootstrap sample of trees quantifies the reproducibility of that node in the original tree.

Bootstrapping is a time-consuming process that scales linearly with the number of replicates, since every bootstrap sample is processed by the same method as the original tree, and post-processing steps are required to enumerate clades. The precision of estimating the node support values increases with the number of bootstrap replicates. For instance, it is not possible to obtain a node support of 99% if fewer than 100 bootstrap samples have been processed.

2. Bayesian sampling. Instead of using bootstrapping to resample the data, one can quantify node support by examining the uncertainty in reconstructing the phylogeny from the given data. Bayesian sampling methods such as Markov chain Monte Carlo (see Hald model) are designed to generate a random sample of parameters from the posterior distribution given the model and data. In this case, the tree is a collection of parameters. A Bayesian estimate of node support can be extracted from this sample of trees by counting the number of trees in which the monophyletic clade that descends from that specific node appears [30]. Bayesian sampling is computationally demanding because the space of all possible trees is enormous, making convergence difficult or not feasible to attain for large data sets[31].

3. Approximate likelihood-ratio testing. Unlike Bayesian sampling, this method is performed on a single estimate of the tree based on maximum likelihood, where the likelihood is the probability of the observed data given the tree and model of evolution. The likelihood ratio test (LRT) is a method for selecting between two models or hypotheses, where the ratio of their likelihoods is a test statistic that is mapped to a null distribution to assess statistical significance. In this application, the alternative hypothesis is that a branch in the reconstructed tree has a length of zero, which would imply that the descendant clade cannot be distinguished from its background[32]. This makes the LRT a localized analysis: it evaluates the support of a node when the rest of the tree is assumed to be true. On the other hand, this narrow scope makes the approximate LRT method computationally efficient in comparison to Bayesian sampling and bootstrap sampling.

#### Background sequences

The interpretation of monophyletic and paraphyletic clades is contingent on whether a sufficient number of infections have been sampled from the host population. Sequences from one host can only become paraphyletic relative to sequences from a second host if the tree contains additional sequences from at least one other host in the population. As noted above, there may be unsampled host individuals in a "transmission chain" connecting the putative source to the recipient host (Figure 1B). The incorporation of background sequences from additional hosts in the population is similar to the problem of rooting a phylogeny using an outgroup, where the root represents the earliest point in time in the tree. The location of this "root" in the section of the tree relating the sequences from the two hosts determines which host is interpreted to be the potential source.

There are no formal guidelines for selecting background sequences. Typically, one incorporates sequences that were collected in the same geographic region as the two hosts under investigation. These local sequences are sometimes augmented with additional sequences that are retrieved from public databases based on their genetic similarity (e.g., BLAST), which were not necessarily collected from the same region. Generally, the background data comprise consensus (bulk) sequences where each host is represented by a single sequence, unlike the putative source and recipient hosts from whom multiple clonal sequences have been sampled. Because clonal sequencing is more labor-intensive, such data are usually not available to use as background sequences. The incorporation of different types of sequences (clonal and bulk) into the same phylogeny may bias the interpretation of results, because it is not possible for sequences to be nested within the consensus sequence from a single background host.

### Phylodynamic methods

Phylodynamics is a subdiscipline of molecular epidemiology and phylogenetics that concerns the reconstruction of epidemiological processes, such as the rapid expansion of an epidemic or the emergence of herd immunity in the host population, from the shape of the phylogenetic tree relating infections sampled from the population[33]. More generally, phylodynamics uses tree shape as the primary data source to parameterize models representing the biological processes that influenced the evolutionary relationships among the observed infections. This should not be confused with fitting models of evolution (such as a nucleotide substitution model or molecular clock model) to reconstruct the shape of the tree from the observed characteristics of related populations (infections), which originates from the field of phylogenetics. The relatively rapid evolution of viruses and bacteria makes it feasible to reconstruct the recent dynamics of an epidemic from the shape of the phylogeny reconstructed from infections sampled in the present.

Fig 3. Components of a phylodynamic source attribution analysis. This diagram summarizes the structure of a phylodynamic analysis as a hierarchical Bayesian model. Rectangular nodes represent data sources (sequence alignment, sample collection times), and nodes with rounded corners represent parameter estimates (fitted models) that can in turn be used as data inputs for a subsequent model. Each arrow represents a model inference step that generates samples from the posterior distribution defined by the data and the model. The model associated with each step is represented by a circular node; each model comprises a number of prior beliefs. This schematic displays multiple model nodes to emphasize the existence of other models with different assumptions and prior beliefs that are not necessarily evaluated on the data.

In the context of source attribution, phylodynamic methods predominantly use Bayesian inference to sample transmission trees from the posterior distribution, where the transmission tree is an explicit model of "who infected whom". Because this tree cannot be directly observed, phylodynamic methods for source attribution attempt to reconstruct the transmission tree from its residual effect on the shape of the phylogenetic tree. Although there are established methods for reconstructing phylogenetic trees from the genetic divergence among pathogen populations sampled from different host individuals, there are several reasons why the phylogeny may be a poor approximation of the transmission tree (Figure 2). Phylodynamic methods attempt to reconcile the discordance between the phylogeny and the transmission tree by modeling one or more of the processes responsible for this discordance, and fitting these models to the data (Figure 3). Although these methods can estimate the probability of a direct transmission from one individual to another, this probability is conditional on how well the model (selected from a number of possible models) approximates reality.

#### Phylogenetic uncertainty

Phylodynamic methods may require the simplifying assumption that the phylogenetic tree reconstructed from the data is the "true" tree - that is, an accurate representation of the common ancestry relating the sampled infections. For instance, some phylodynamic methods take a single time-scaled phylogeny as a required input. On the other hand, if the phylogeny is handled as an uncertain estimate derived from the data (including the sequence alignment), then the phylodynamic analysis becomes a hierarchical model in which the problem of phylogenetic reconstruction is nested within the problem of estimating the other model parameters that are conditional on the phylogeny (Figure 3). Sampling both the phylogeny and the transmission tree from their joint posterior distribution should confer more accurate parameter estimates, but the greatly expanded model space also makes convergence more challenging attain. Such hierarchical methods are often implemented in the software package BEAST2[34] (Bayesian Evolutionary Analysis by Sampling Trees), which provides generic routines for Markov chain Monte Carlo sampling from tree space, and calculating the likelihood of a time-scaled phylogenetic tree given sequence data and sample collection dates.

The limits to reconstructing an accurate phylogeny from a sequence alignment is not the only source of phylogenetic uncertainty. Although alignments are often treated as observed data known without ambiguity, the process of aligning sequences is also uncertain and can become more difficult with the rapid accumulation of sequence insertions and deletions among diverging pathogen lineages. While there are Bayesian methods that address uncertainty in alignment by joint sampling of the alignment along with the phylogeny[35], this approach is computationally complex and is seldom used in the context of source attribution. Furthermore, sequences are themselves uncertain estimates of the genetic composition of individual pathogens or infecting populations, and next-generation sequencing technologies tend to have substantially higher error rates than conventional Sanger sequencing[36], and analysis pipelines must be carefully validated to reduce the effects of sample cross-contamination and adapter contamination.

#### Demographic and transmission models

Some phylodynamic methods make the simplifying assumption that every infection in the epidemic is represented by at least one genetic sequence in the data set[37][38][39] (complete sampling). Although complete sampling may be feasible in circumstances such as an outbreak of disease transmission among farms in a defined geographic region[40], it is generally not possible to rule out unsampled sources in other contexts. This is especially true for infectious diseases that are stigmatized and/or associated with marginalized populations[41], that have a long asymptomatic period[42], or in the context of a generalized epidemic where disease prevalence may substantially exceed the local capacity for sample collection and genetic sequencing.

Several methods attempt to address the presence of unsampled hosts by modeling the growth of the epidemic over time, which predicts the total number of infected hosts at any given time. Put another way, the probability that an infection was transmitted from an unsampled source is determined in part by the total size of the infected population at the time of transmission. These models of epidemic growth are sometimes referred to as demographic models because some are derived from population growth models such as the exponential and logistic growth models. Alternatively, the number of infections can be modeled by a compartmental model that describes the rate that individual hosts switch from susceptible to infected states, and can be extended to incorporate additional states such as recovery from infection or different stages of infection[43][38]. An important distinction between population growth and compartmental models is that the number of uninfected susceptible hosts is tracked explicitly in the latter.

A phylodynamic analysis attempts to parameterize the growth model from the phylogeny, which is either used as a direct proxy of the transmission tree, or interpreted through a population genetic model to account for diversity within hosts (Figure 3). Bayesian methods make it feasible to supplement this task with other data sources, such as the reported case incidence and/or prevalence over time[44]. The transmission process can be mapped to the size of the infected population using either coalescent or a forward-time model such as birth-death or branching processes. A coalescent model predicts the amount of time one would need to trace a random sample of infections back in time to reach their common ancestor[45]. The amount of time between the sampled infections and their common ancestor increases with the size of the infected population, just as one is less likely to share a great-grandparent in common with a stranger in a large city than in a small, remote village. Although the coalescent model was originally formulated to describe the ancestry of individuals within a single population, it was recently extended to compartmental models of the spread of infection[46].

Birth-death models describe the proliferation of infections forward in time, where a "birth" event represents the transmission of an infection to an uninfected susceptible host, and a "death" event can represent either the diagnosis and treatment of an infection, or its spontaneous clearance by the host[47]. This class of models was originally formulated to describe the proliferation of species through speciation and extinction [48]. Similarly, branching processes model the growth of an epidemic forward in time where the number of transmissions from each infected host ("offspring") is described by a discrete probability distribution over non-negative integers, such as the negative binomial distribution[49]. Branching process models tend to use the simplifying assumption that this offspring distribution remains constant over time, making this class of models more appropriate for the initial stage of an epidemic where most of the population is uninfected.

#### Within-host diversity

As noted above, the diversification of pathogen populations within each host results in a discordance between the shapes of the pathogen phylogeny and the transmission tree. Phylodynamic methods that treat the phylogeny as equivalent to the transmission tree assume implicitly that the population within each host is small enough to be approximated by a single lineage[40][50][51]. If the within-host population is diverse, then a transmission event will tend to underestimate the time since two lineages split from their common ancestor (Figure 2A); this phenomenon is analogous to the incomplete lineage sorting affecting gene trees relative to the species tree[52]. The resulting discordance between the phylogenetic and transmission trees makes it more difficult to reconstruct the latter from the observed data. Moreover, the effect of within-host diversity becomes even greater if there are incomplete transmission bottlenecks — where a new infection is established by more than one lineage transmitted from the source population — because the common ancestor of pathogen lineages may be located in previous hosts further back in time[53].

Accounting for within-host evolution requires a population genetic model that describes the common ancestry of lineages in a population. The coalescent model is a common choice for this task. This requires making assumptions about how the pathogen population within a host varies in size over time, since the coalescent rate (i.e., the rate that lineages converge to common ancestors going back in time) is inversely proportional to population size. It is conceivable that a within-host growth model could be parameterized from longitudinal measures of sequence variation or viral load; however, this approach is seldom used in the literature. We note that the coalescent model appears twice in our summary of phylodynamic methods to represent processes at different levels — the transmission of infection among hosts, and the diversification of pathogens within a host. This parallel has been explored by phylodynamic studies using the structured coalescent, where the population comprises two or more subpopulations (demes) [53]. Each deme represents an infected host individual. Due to limited migration of pathogen lineages between demes, two pathogen lineages sampled at random are more likely to share a recent common ancestor if they belong to the same deme. One of the limitations of this model is that the size of the transmission bottleneck (number of lineages traced back to the source deme) is a random outcome, making it difficult to assert that bottlenecks tend to limit the transmitted viruses to a single variant that starts the next infection.

## Controversies

Source attribution is an inherently controversial application of molecular epidemiology because it identifies a specific population or individual as being responsible for the onward transmission of an infectious disease. Because source attribution increasingly requires the specialized and computationally-intensive analysis of complex data, the underlying model assumptions and level of uncertainty in these analyses are often not accessible to knowledge users and stakeholders, including legal professionals and community advocates.

### Molecular forensics and HIV-1 transmission

Outside of a public health context, the concept of source attribution has significant legal and ethical implications for people living with HIV to potentially become prosecuted for transmitting their infection to another person. The transmission of HIV-1 without disclosing one's infection status is a criminally prosecutable offense in many countries[54], including the United States. For example, defendants in HIV transmission cases in Canada have been charged with aggravated sexual assult, with a "maximum penalty of life imprisonment and mandatory lifetime registration as a sex offender"[55]. Source attribution methods have been utilized as forensic evidence in such criminal cases.

#### Forensic applications of phylogenetic clustering

One of the earliest and well-known examples of an HIV-1 transmission case was the investigation of the so-called "Florida dentist", where an HIV-positive dentist was accused of transmitting his infection to a patient. Although phylogenetic clustering was applied to these data to demonstrate that HIV-1 particles sampled from the dentist were genetically similar to those sampled from the patient[56], clustering is not itself a source attribution method. Clustering methods can only provide evidence that infections are unlikely to be epidemiologically linked because they are too dissimilar relative to other infections in the population [57]. For example, similar phylogenetic methods were used in a subsequent case to demonstrate that the HIV-1 sequence obtained from the patient was far more similar to the sequence from their sexual partner than the sequence from the dentist [58].

Clustering provides no information on the directionality of transmission (e.g., whether the infection was transmitted from individual A to individual B, or from B to A; Figure 2), nor can it rule out the possibility that one or more other unknown persons (from whom no virus sequences have been obtained) were involved in the transmission history. Despite these known limitations of clustering, statements on the genetic similarity of infections continue to appear in court cases[59]. On the other hand, clusters potentially identify groups of individuals exposed to higher rates of transmission. Consequently, public health applications of clustering that can prevent the onward transmission of HIV-1 are hampered by concerns among people living with HIV that this information might also expose them to criminal prosecution for transmission[60][61].

#### Forensic applications of paraphyly methods

Fig 4. Reproduction of phylogenetic tree from Metzker study. This unrooted tree was reconstructed by maximum likelihood from published HIV-1 RT sequences from Metzker et al.[62] and supplemented with additional sequences from Genbank. Tips representing sequences are coloured by source (see legend), and branches are coloured by bootstrap support (darker shades indicate higher support). The branch (labelled '*') separating sequences from both patients P and V from the "background" sequences, including the original LA (Louisiana) control sequences from the study, had a support of 95%. The branch (labelled '**') cited by the study as evidence that sequences from patient V were nested within a paraphyletic group of sequences from patient P had a support of 100%.

Source attribution methods based on paraphyly have been used in the prosecution of individuals for HIV-1 transmission. One of the earliest examples was published in 2002, where a physician was accused of intentionally injecting blood from one patient (P) who was HIV-1 positive into another patient (V) who had previous been in a relationship with the physician[62]. This study used maximum likelihood methods to reconstruct a phylogenetic tree relating HIV-1 sequences from both patients. Paraphyly of sequences from P implying either direct or indirect transmission to V was reported for the phylogeny reconstructed from RT sequences (Figure 4). However, a second tree reconstructed from the more diverse HIV-1 envelope (env) sequences from the same group was inconclusive on the direction of transmission - only that the env sequences from patients P and V clustered respectively into two monophyletic groups that were jointly distinct from the background.

The use of paraphyly for source attribution was stimulated with the onset of next-generation sequencing, which made it more cost-effective to rapidly sequence large numbers of individual viruses from multiple host individuals. More recent work[63] has also developed a formalized framework for interpreting the distribution of sequences in the phylogeny as being consistent with a direction of transmission. Several studies have since applied this framework to re-analyze or develop forensic evidence for HIV transmission cases in Serbia[64], Taiwan[65], China[66] and Portugal[67]. The growing number of such studies has led to controversy on the ethical and legal implications of this type of phylogenetic analysis for HIV-1[68].

The accuracy of classifying a group of sequences in a phylogeny into monophyletic or paraphyletic groups is highly contingent on the accuracy of tree reconstruction. As described above (see Paraphyly), our statistical confidence of a specific clade in the tree is quantified by the estimated probability that the same clade would be obtained if the tree reconstruction was repeated on an equivalent data set. This support value is not the probability that the clade appears in the "true" tree because this quantity is conditional on the data at hand - however, it is often misinterpreted this way[69]. If the branch separating a nested monophyletic clade of sequences from host A from the paraphyletic group of sequences from host B has a low support value, then the conventional procedure would be to remove that branch from the tree. This would have the result of collapsing the monophyletic and paraphyletic clades so that the tree is inconclusive about either direction of transmission. However, this procedure has not been consistently used in source attribution investigations. For example, the trees displayed in the 2020 study in Taiwan[65] do not support transmission from the defendent to the plaintiff when branches with low support (<80%) are collapsed. Moreover, the result can vary with the region of the virus genome targeted for sequencing[70].

The use of paraphyly to infer the direction of transmission was recently evaluated on a prospective cohort of HIV serodiscordant couples (where one partner was HIV positive at the start of the study)[71]. Applying the paraphyly method to next-generation sequence data generated from samples obtained from 33 pairs where the HIV negative partner became infected over the course of the study, the authors found that the direction of transmission was incorrectly reconstructed in about 13% to 21% of cases, depending on which sequences were analyzed. However, a follow-up study involving many of the same authors[72] used a more comprehensive sequencing method to cover the full virus genome in depth from all host individuals, lowering the percentage of misclassified cases to 3.1%.

#### Forensic applications of phylodynamics

A common feature of both clustering and paraphyly methods is that neither approach explicitly tests the hypothesis that an infection was directly transmitted from a specific source population or individual to the recipient. Phylodynamic methods attempt to overcome the discordance between the pathogen phylogeny and the underlying transmission history by modeling the processes that contribute to this discordance, such as the evolution of pathogen populations within each host. The development of phylodynamic methods for source attribution has been a rapidly expanding area, with a large number of published studies and associated software released since 2014 (see Software). Because these methods have tended to be applied to other infectious diseases including influenza A virus [73], foot-and-mouth disease virus[74] and Mycobacterium tuberculosis[38], they have so far avoided the ethical issues of stigma and criminalization associated with HIV-1. However, applications of phylodynamic source attribution to HIV-1 have begun to appear in the literature[75]. In this early study, the investigators attempted to use a phylodynamic method to reconstruct transmission events among patients receiving treatment at their clinic from HIV-1 sequence data. Although the method used in their study (TransPhylo[49]) attempts, by default, to estimate the proportion of infections that are unsampled, the investigators fixed this proportion to 1%. By so doing, their analysis carried the unrealistic assumption that nearly every person living with HIV-1 in their regional epidemic (comprising at least 1,800 people) was represented in their data set of 139 sequences.

### 2010 cholera outbreak in Haiti

In the aftermath of a magnitude 7.0 earthquake that struck Haiti in 2010, there was a large-scale outbreak of cholera, a gastrointestinal infection caused by the bacterium Vibrio cholerae. Nearly 800,000 Haitians became infected and nearly 10,000 died in one of the most significant outbreaks of cholera in modern history. Initial microbial subtyping using pulsed-field gel electrophoresis indicated that the outbreak was most genetically similar to cholera strains sampled in South Asia[76]. In order to more comprehensively map the plausible source of infection, cholera strains from Southern Asia and South America were compared to the strains sampled from the Haitian outbreak. Whole genome sequences taken from cases in Haiti shared more sites in common with the sequences taken from South Asia (i.e., Nepal and Bangladesh) than those in geographic areas more immediate to Haiti[77]. Direct comparisons were also made between the cholera strains taken from three Nepalese soldiers and three Haitian locals, which were nearly identical in genome sequence, forming a phylogenetic cluster[78]. Based on the evidence gathered by phylogenetic source attribution studies, the role of Nepalese soldiers who were part of the United Nations Stabilization Mission to Haiti (MINUSTAH) in this outbreak was officially recognized by the United Nations in 2016[79].

### 2019/2020 novel coronavirus outbreak

In December 2019, an outbreak of 27 cases of viral pneumonia was reported in association with a seafood market in Wuhan, China. Known respiratory viruses including influenza A virus, respiratory syncytial virus and SARS coronavirus were soon ruled out by laboratory testing. On January 10, 2020, the genome sequence of the novel coronavirus, most closely related to bat SARS-coronaviruses, was released into the public domain. Despite unprecedented quarantine measures, the virus (eventually named SARS-CoV2) spread to other countries including the United States, with global prevalence exceeding 97,000 confirmed cases as of March 5, 2020.

This outbreak spurred an unprecedented level of epidemiological and genomic data sharing and real-time analysis that were communicated by social media prior to peer review. Much of this knowledge translation was mediated through the open-source project Nextstrain[80], that performs phylogenetic analyses on pathogen sequence data as they become available on public and access-restricted databases and uses the results to update web documents in real time. On March 4, 2020, Nextstrain developers released a phylogeny in which a SARS-CoV2 sequence that was isolated from a German patient occupied an ancestral position relative to a monophyletic clade of sequences sampled from Europe and Mexico. Users of the Twitter social media platform soon commented on the related post from Nextstrain that onward transmission from the German individual seemed to have "led directly to some fraction of the widespread outbreak circulating in Europe today"[81]. These comments were soon followed by criticism from other users that attributing the outbreak in Europe to the German patient as the source individual was drawing conclusions "about directionality of transmission from a sparsely and heterogeneously sampled tree."[82]. In other words, the tree was reconstructed from a highly incomplete sample of cases from the ongoing outbreak, and the addition of other sequences had a substantial probability of modifying the inferred relationship between the German sequence and the clade in question. Nevertheless, the interpretation attributing the European outbreak to a German source propagated through social media, causing some users to call on Germany to apologize[83].

## Software

There are numerous computational tools for source attribution that have been published, particularly for phylodynamic methods. Table 1 provides a non-exhaustive listing of some of the software in the public domain. Several of these programs are implemented within the Bayesian software package BEAST[34], including SCOTTI, BadTrIP, and beastlier. This listing does not include clustering methods, which are not designed for the purpose of source attribution, but may be used to develop microbial subtype definitions --- clustering methods have previously been reviewed in molecular epidemiology literature[84][8].

Table 1. Summary of source attribution software packages in the public domain.
Name Ref Method Input Software License
outbreaker2 [85] Phylodynamic, Bayesian Sampling dates, contacts, genetic sequences Contributor Code of Conduct
SCOTTI [53] Phylodynamic, Bayesian (MCMC, structural coalescent) Sampling dates, genetic sequences GNU General Public License v3.0
seqTrack [73] Genetic distance clustering (directed graph, maximum parsimony) Collection dates, genetic distances GNU General Public License v3.0
sourceR [86] Hald model, Bayesian MCMC Sampling dates, sampling locations, genetic sequences GNU General Public License v3.0
TransPhylo [49] Phylodynamic, Bayesian (MCMC, branching process) Sampling dates, genetic sequences GNU General Public License v2.0
PhyloScanner [25] Ancestral reconstruction (maximum parsimony) Short read alignment (BAM format) GNU General Public License v3.0
QUENTIN [87] Phylodynamic, Bayesian (MCMC) Genetic sequences GNU General Public License v3.0
genPomp [88] Phylodynamic, Bayesian (sequential Monte Carlo) Sampling dates, genetic sequences GNU General Public License v3.0
BadTrIP [89] Phylodynamic, Bayesian (MCMC) Sampling dates, genetic sequences, host infectious interval GNU General Public License v3.0
phybreak [90] Phylodynamic, Bayesian (MCMC) Sampling dates, genetic sequences GNU General Public License v3.0
TransPairs [91] Phylodynamic, maximum-likelihood (optimum branching) Phylogeny GNU General Public License v3.0
beastlier [92] Phylodynamic, Bayesian Sequence alignment, non-infectious dates, sample collection dates None specified

## References

1. ^ Barco L, Barrucci F, Olsen JE & Ricci A (2013) Salmonella source attribution based on microbial subtyping Int. J. Food Microbiol. 163:193-203 [PMID: 23562696][DOI]
2. ^ Urwin R & Maiden MC (2003) Multi-locus sequence typing: a tool for global epidemiology Trends Microbiol. 11:479-87 [PMID: 14557031][DOI]
3. ^ Fuller CW, Middendorf LR, Benner SA, Church GM, Harris T, Huang X, Jovanovich SB, Nelson JR, Schloss JA, Schwartz DC & Vezenov DV (2009) The challenges of sequencing by synthesis Nat. Biotechnol. 27:1013-23 [PMID: 19898456][DOI]
4. ^ Kisand V & Lettieri T (2013) Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools BMC Genomics 14:211 [PMID: 23547799][DOI]
5. ^ De Smet B, Sarovich DS, Price EP, Mayo M, Theobald V, Kham C, Heng S, Thong P, Holden MT, Parkhill J, Peacock SJ, Spratt BG, Jacobs JA, Vandamme P & Currie BJ (2015) Whole-genome sequencing confirms that Burkholderia pseudomallei multilocus sequence types common to both Cambodia and Australia are due to homoplasy J. Clin. Microbiol. 53:323-6 [PMID: 25392354][DOI]
6. ^ a b Walker TM, Ip CL, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, Eyre DW, Wilson DJ, Hawkey PM, Crook DW, Parkhill J, Harris D, Walker AS, Bowden R, Monk P, Smith EG & Peto TE (2013) Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study Lancet Infect Dis 13:137-46 [PMID: 23158499][DOI]
7. ^ a b Gardy JL, Johnston JC, Ho Sui SJ, Cook VJ, Shah L, Brodkin E, Rempel S, Moore R, Zhao Y, Holt R, Varhol R, Birol I, Lem M, Sharma MK, Elwood K, Jones SJ, Brinkman FS, Brunham RC & Tang P (2011) Whole-genome sequencing and social-network analysis of a tuberculosis outbreak N. Engl. J. Med. 364:730-9 [PMID: 21345102][DOI]
8. ^ a b c Poon AF (2016) Impacts and shortcomings of genetic clustering methods for infectious disease outbreaks Virus Evol 2:vew031 [PMID: 28058111][DOI]
9. ^ Robertson DL, Anderson JP, Bradac JA, Carr JK, Foley B, Funkhouser RK, Gao F, Hahn BH, Kalish ML, Kuiken C, Learn GH, Leitner T, McCutchan F, Osmanov S, Peeters M, Pieniazek D, Salminen M, Sharp PM, Wolinsky S & Korber B (2000) HIV-1 nomenclature proposal Science 288:55-6 [PMID: 10766634][DOI]
10. ^ Leitner T, Escanilla D, Franzén C, Uhlén M & Albert J (1996) Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis Proc. Natl. Acad. Sci. U.S.A. 93:10864-9 [PMID: 8855273][DOI]
11. ^ Volz EM, Koopman JS, Ward MJ, Brown AL & Frost SD (2012) Simple epidemiological dynamics explain phylogenetic clustering of HIV from patients with recent infection PLoS Comput. Biol. 8:e1002552 [PMID: 22761556][DOI]
12. ^ Van Pelt W (1999) Surveillance of salmonella: achievements and future directions Euro Surveill. 4:51 [PMID: 12631901][DOI]
13. ^ Hald T, Lo Fo Wong DM & Aarestrup FM (2007) The attribution of human infections with antimicrobial resistant Salmonella bacteria in Denmark to sources of animal origin Foodborne Pathog. Dis. 4:313-26 [PMID: 17883315][DOI]
14. ^ Wilson DJ, Gabriel E, Leatherbarrow AJ, Cheesbrough J, Gee S, Bolton E, Fox A, Fearnhead P, Hart CA & Diggle PJ (2008) Tracing the source of campylobacteriosis PLoS Genet. 4:e1000203 [PMID: 18818764][DOI]
15. ^ Yang Z & Rannala B (2012) Molecular phylogenetics: principles and practice Nat. Rev. Genet. 13:303-14 [PMID: 22456349][DOI]
16. ^ Ypma RJ, Bataille AM, Stegeman A, Koch G, Wallinga J & van Ballegooijen WM (2012) Unravelling transmission trees of infectious diseases by combining genetic and epidemiological data Proc. Biol. Sci. 279:444-50 [PMID: 21733899][DOI]
17. ^ Walker TM, Ip CL, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, Eyre DW, Wilson DJ, Hawkey PM, Crook DW, Parkhill J, Harris D, Walker AS, Bowden R, Monk P, Smith EG & Peto TE (2013) Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study Lancet Infect Dis 13:137-46 [PMID: 23158499][DOI]
18. ^ Guindon S & Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood Syst. Biol. 52:696-704 [PMID: 14530136][DOI]
19. ^ Thomas DL, Thio CL, Martin MP, Qi Y, Ge D, O'Huigin C, Kidd J, Kidd K, Khakoo SI, Alexander G, Goedert JJ, Kirk GD, Donfield SM, Rosen HR, Tobler LH, Busch MP, McHutchison JG, Goldstein DB & Carrington M (2009) Genetic variation in IL28B and spontaneous clearance of hepatitis C virus Nature 461:798-801 [PMID: 19759533][DOI]
20. ^ Bowen DG & Walker CM (2005) Adaptive immune responses in acute and chronic hepatitis C virus infection Nature 436:946-52 [PMID: 16107834][DOI]
21. ^ Cunningham EB, Applegate TL, Lloyd AR, Dore GJ & Grebely J (2015) Mixed HCV infection and reinfection in people who inject drugs--impact on therapy Nat Rev Gastroenterol Hepatol 12:218-30 [PMID: 25782091][DOI]
22. ^ Rose R, Lamers SL, Massaccesi G, Osburn W, Ray SC, Thomas DL, Cox AL & Laeyendecker O (2018) Complex patterns of Hepatitis-C virus longitudinal clustering in a high-risk population Infect. Genet. Evol. 58:77-82 [PMID: 29253674][DOI]
23. ^ Joy JB, Liang RH, McCloskey RM, Nguyen T & Poon AF (2016) Ancestral Reconstruction PLoS Comput. Biol. 12:e1004763 [PMID: 27404731][DOI]
24. ^ Svatun B, Saxton CA, Rölla G & van der Ouderaa F (1989) One-year study of the efficacy of a dentifrice containing zinc citrate and triclosan to maintain gingival health Scand J Dent Res 97:242-6 [PMID: 2740835][DOI]
25. ^ a b Wymant C, Hall M, Ratmann O, Bonsall D, Golubchik T, de Cesare M, Gall A, Cornelissen M & Fraser C (2018) PHYLOSCANNER: Inferring Transmission from Within- and Between-Host Pathogen Genetic Diversity Mol. Biol. Evol. 35:719-733 [PMID: 29186559][DOI]
26. ^ Faria NR, Rambaut A, Suchard MA, Baele G, Bedford T, Ward MJ, Tatem AJ, Sousa JD, Arinaminpathy N, Pépin J, Posada D, Peeters M, Pybus OG & Lemey P (2014) HIV epidemiology. The early spread and epidemic ignition of HIV-1 in human populations Science 346:56-61 [PMID: 25278604][DOI]
27. ^ Romero-Severson EO, Bulla I & Leitner T (2016) Phylogenetically resolving epidemiologic linkage Proc. Natl. Acad. Sci. U.S.A. 113:2690-5 [PMID: 26903617][DOI]
28. ^ Lemoine F, Domelevo Entfellner JB, Wilkinson E, Correia D, Dávila Felipe M, De Oliveira T & Gascuel O (2018) Renewing Felsenstein's phylogenetic bootstrap in the era of big data Nature 556:452-456 [PMID: 29670290][DOI]
29. ^ Felsenstein J (1985) CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP Evolution 39:783-791 [PMID: 28561359][DOI]
30. ^ Larget B, Simon DL. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Molecular Biology and Evolution. 1999 Jun 1;16(6):750-9.
31. ^ Whidden C & Matsen FA (2015) Quantifying MCMC exploration of phylogenetic tree space Syst. Biol. 64:472-91 [PMID: 25631175][DOI]
32. ^ Anisimova M & Gascuel O (2006) Approximate likelihood-ratio test for branches: A fast, accurate, and powerful alternative Syst. Biol. 55:539-52 [PMID: 16785212][DOI]
33. ^ Volz EM, Koelle K & Bedford T (2013) Viral phylodynamics PLoS Comput. Biol. 9:e1002947 [PMID: 23555203][DOI]
34. ^ a b Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu CH, Xie D, Suchard MA, Rambaut A & Drummond AJ (2014) BEAST 2: a software platform for Bayesian evolutionary analysis PLoS Comput. Biol. 10:e1003537 [PMID: 24722319][DOI]
35. ^ Redelings BD & Suchard MA (2005) Joint Bayesian estimation of alignment and phylogeny Syst. Biol. 54:401-18 [PMID: 16012107][DOI]
36. ^ van Dijk EL, Jaszczyszyn Y, Naquin D & Thermes C (2018) The Third Revolution in Sequencing Technology Trends Genet. 34:666-681 [PMID: 29941292][DOI]
37. ^ Ypma RJ, van Ballegooijen WM & Wallinga J (2013) Relating phylogenetic trees to transmission trees of infectious disease outbreaks Genetics 195:1055-62 [PMID: 24037268][DOI]
38. ^ a b c Didelot X, Gardy J & Colijn C (2014) Bayesian inference of infectious disease transmission from whole-genome sequence data Mol. Biol. Evol. 31:1869-79 [PMID: 24714079][DOI]
39. ^ Worby CJ, O'Neill PD, Kypraios T, Robotham JV, De Angelis D, Cartwright EJ, Peacock SJ & Cooper BS (2016) Reconstructing transmission trees for communicable diseases using densely sampled genetic data Ann Appl Stat 10:395-417 [PMID: 27042253][DOI]
40. ^ a b Cottam EM, Thébaud G, Wadsworth J, Gloster J, Mansley L, Paton DJ, King DP & Haydon DT (2008) Integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus Proc. Biol. Sci. 275:887-95 [PMID: 18230598][DOI]
41. ^ Weiss MG, Ramakrishna J & Somma D (2006) Health-related stigma: rethinking concepts and interventions Psychol Health Med 11:277-87 [PMID: 17130065][DOI]
42. ^ Makar AB, McMartin KE, Palese M & Tephly TR (1975) Formate assay in body fluids: application in methanol poisoning Biochem Med 13:117-26 [PMID: 1][DOI]
43. ^ Volz EM & Frost SD (2013) Inferring the source of transmission with phylogenetic data PLoS Comput. Biol. 9:e1003397 [PMID: 24367249][DOI]
44. ^ Magiorkinis G, Sypsa V, Magiorkinis E, Paraskevis D, Katsoulidou A, Belshaw R, Fraser C, Pybus OG & Hatzakis A (2013) Integrating phylodynamics and epidemiology to estimate transmission diversity in viral epidemics PLoS Comput. Biol. 9:e1002876 [PMID: 23382662][DOI]
45. ^ Kingman, J. F. C. (1982). The coalescent. Stochastic processes and their applications, 13(3), 235-248. https://www.sciencedirect.com/science/article/pii/0304414982900114
46. ^ Volz EM (2012) Complex population dynamics and the coalescent under neutrality Genetics 190:187-201 [PMID: 22042576][DOI]
47. ^ Stadler T (2009) On incomplete sampling under birth-death models and connections to the sampling-based coalescent J. Theor. Biol. 261:58-66 [PMID: 19631666][DOI]
48. ^ Nee S, May RM & Harvey PH (1994) The reconstructed evolutionary process Philos. Trans. R. Soc. Lond., B, Biol. Sci. 344:305-11 [PMID: 7938201][DOI]
49. ^ a b c Didelot X, Fraser C, Gardy J & Colijn C (2017) Genomic Infectious Disease Epidemiology in Partially Sampled and Ongoing Outbreaks Mol. Biol. Evol. 34:997-1007 [PMID: 28100788][DOI]
50. ^ Volz EM, Koelle K & Bedford T (2013) Viral phylodynamics PLoS Comput. Biol. 9:e1002947 [PMID: 23555203][DOI]
51. ^ Jombart T, Cori A, Didelot X, Cauchemez S, Fraser C & Ferguson N (2014) Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data PLoS Comput. Biol. 10:e1003457 [PMID: 24465202][DOI]
52. ^ Maddison WP & Knowles LL (2006) Inferring phylogeny despite incomplete lineage sorting Syst. Biol. 55:21-30 [PMID: 16507521][DOI]
53. ^ a b c De Maio N, Wu CH & Wilson DJ (2016) SCOTTI: Efficient Reconstruction of Transmission within Outbreaks with the Structured Coalescent PLoS Comput. Biol. 12:e1005130 [PMID: 27681228][DOI]
54. ^ http://www.hivjustice.net/site/countries/
56. ^ Ou CY, Ciesielski CA, Myers G, Bandea CI, Luo CC, Korber BT, Mullins JI, Schochetman G, Berkelman RL & Economou AN (1992) Molecular epidemiology of HIV transmission in a dental practice Science 256:1165-71 [PMID: 1589796][DOI]
57. ^ Bernard EJ, Azad Y, Vandamme AM, Weait M & Geretti AM (2007) HIV forensics: pitfalls and acceptable standards in the use of phylogenetic analysis as evidence in criminal investigations of HIV transmission HIV Med. 8:382-7 [PMID: 17661846][DOI]
58. ^ Jaffe HW, McCurdy JM, Kalish ML, Liberti T, Metellus G, Bowman BH, Richards SB, Neasman AR & Witte JJ (1994) Lack of HIV transmission in the practice of a dentist with AIDS Ann. Intern. Med. 121:855-9 [PMID: 7978698][DOI]
59. ^ R. v. Ngeruka, 2015 YKTC 10
60. ^ Schairer C, Mehta SR, Vinterbo SA, Hoenigl M, Kalichman M & Little S (2017) Perceptions of molecular epidemiology studies of HIV among stakeholders J Public Health Res 6:992 [PMID: 29291190][DOI]
61. ^ McClelland, A., Guta, A., & Gagnon, M. (2019). The rise of molecular HIV surveillance: implications on consent and criminalization. Critical Public Health, 1-7. https://doi.org/10.1080/09581596.2019.1582755
62. ^ a b Metzker ML, Mindell DP, Liu XM, Ptak RG, Gibbs RA & Hillis DM (2002) Molecular evidence of HIV-1 transmission in a criminal case Proc. Natl. Acad. Sci. U.S.A. 99:14292-7 [PMID: 12388776][DOI]
63. ^ Romero-Severson EO, Bulla I & Leitner T (2016) Phylogenetically resolving epidemiologic linkage Proc. Natl. Acad. Sci. U.S.A. 113:2690-5 [PMID: 26903617][DOI]
64. ^ Siljic M, Salemovic D, Cirkovic V, Pesic-Pavlovic I, Ranin J, Todorovic M, Nikolic S, Jevtovic D & Stanojevic M (2017) Forensic application of phylogenetic analyses - Exploration of suspected HIV-1 transmission case Forensic Sci Int Genet 27:100-105 [PMID: 28024238][DOI]
65. ^ a b Li WY, Huang SW, Wang SF, Liu HF, Chou CH, Wu SJ, Huang HD, Lu PL, Fann CSJ, Chen M, Chen YH & Chen YA (2020) Source identification of HIV-1 transmission in three lawsuits Using Ultra-Deep pyrosequencing and phylogenetic analysis J Microbiol Immunol Infect [PMID: 32067946][DOI]
66. ^ Wu J, Hu Z, Yao H, Wang H, Lei Y, Zhong P, Feng Y, Xing H, Shen Y, Jin L, Liu A, Qin Y, Miao L, Su B, Zhang Y & Guo H (2019) The inference of HIV-1 transmission direction between HIV-1 positive couples based on the sequences of HIV-1 quasi-species BMC Infect. Dis. 19:566 [PMID: 31253127][DOI]
67. ^ Romero-Severson EO, Bulla I, Hengartner N, Bártolo I, Abecasis A, Azevedo-Pereira JM, Taveira N & Leitner T (2017) Donor-Recipient Identification in Para- and Poly-phyletic Trees Under Alternative HIV-1 Transmission Hypotheses Using Approximate Bayesian Computation Genetics 207:1089-1101 [PMID: 28912340][DOI]
68. ^ Taylor, B., & Sapién, H. (2020). Determining the direction of HIV transmission: Benefits and potential harms of taking phylogenetic analysis one step further. Clinical Infectious Diseases. https://doi.org/10.1093/cid/ciz1248
69. ^ Berry, V., & Gascuel, O. (1996). On the interpretation of bootstrap trees: appropriate threshold of clade selection and induced gain. Molecular Biology and Evolution, 13(7), 999-1011. https://doi.org/10.1093/molbev/13.7.999
70. ^ Todesco E, Wirden M, Calin R, Simon A, Sayon S, Barin F, Katlama C, Calvez V, Marcelin AG & Hué S (2019) Caution is needed in interpreting HIV transmission chains by ultradeep sequencing AIDS 33:691-699 [PMID: 30585843][DOI]
71. ^ Rose R, Hall M, Redd AD, Lamers S, Barbier AE, Porcella SF, Hudelson SE, Piwowar-Manning E, McCauley M, Gamble T, Wilson EA, Kumwenda J, Hosseinipour MC, Hakim JG, Kumarasamy N, Chariyalertsak S, Pilotto JH, Grinsztejn B, Mills LA, Makhema J, Santos BR, Chen YQ, Quinn TC, Fraser C, Cohen MS, Eshleman SH & Laeyendecker O (2019) Phylogenetic Methods Inconsistently Predict the Direction of HIV Transmission Among Heterosexual Pairs in the HPTN 052 Cohort J. Infect. Dis. 220:1406-1413 [PMID: 30590741][DOI]
72. ^ Zhang Y, Wymant C, Laeyendecker O, Grabowski MK, Hall M, Hudelson S, Piwowar-Manning E, McCauley M, Gamble T, Hosseinipour MC, Kumarasamy N, Hakim JG, Kumwenda J, Mills LA, Santos BR, Grinsztejn B, Pilotto JH, Chariyalertsak S, Makhema J, Chen YQ, Cohen MS, Fraser C & Eshleman SH (2020) Evaluation of phylogenetic methods for inferring the direction of HIV transmission: HPTN 052 Clin. Infect. Dis. [PMID: 31922537][DOI]
73. ^ a b Jombart T, Eggo RM, Dodd PJ & Balloux F (2011) Reconstructing disease outbreaks from genetic data: a graph approach Heredity (Edinb) 106:383-90 [PMID: 20551981][DOI]
74. ^ Morelli MJ, Thébaud G, Chadœuf J, King DP, Haydon DT & Soubeyrand S (2012) A Bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data PLoS Comput. Biol. 8:e1002768 [PMID: 23166481][DOI]
75. ^ Mak, L., Perera, D., Lang, R., Kossinna, P., He, J., Gill, M. J. et al. (2020). Evaluation of A Phylogenetic Pipeline to Examine Transmission Networks in A Canadian HIV Cohort. Microorganisms, 8(2), 196. https://doi.org/10.3390/microorganisms8020196
76. ^ https://www.cdc.gov/media/pressrel/2010/r101101.html
77. ^ Orata FD, Keim PS & Boucher Y (2014) The 2010 cholera outbreak in Haiti: how science solved a controversy PLoS Pathog. 10:e1003967 [PMID: 24699938][DOI]
78. ^ Hendriksen RS, Price LB, Schupp JM, Gillece JD, Kaas RS, Engelthaler DM, Bortolaia V, Pearson T, Waters AE, Upadhyay BP, Shrestha SD, Adhikari S, Shakya G, Keim PS & Aarestrup FM (2011) Population genetics of Vibrio cholerae from Nepal in 2010: evidence on the origin of the Haitian outbreak mBio 2:e00157-11 [PMID: 21862630][DOI]
79. ^ https://www.nytimes.com/2016/08/18/world/americas/united-nations-haiti-cholera.html
80. ^ Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T & Neher RA (2018) Nextstrain: real-time tracking of pathogen evolution Bioinformatics 34:4121-4123 [PMID: 29790939][DOI]
83. ^ Kai Kupferschmidt (2020) Mutations can reveal how the coronavirus moves—but they’re easy to overinterpret. Science Magazine, doi:10.1126/science.abb6526
84. ^ Hassan AS, Pybus OG, Sanders EJ, Albert J & Esbjörnsson J (2017) Defining HIV-1 transmission clusters based on sequence data AIDS 31:1211-1222 [PMID: 28353537][DOI]
85. ^ Campbell F, Didelot X, Fitzjohn R, Ferguson N, Cori A & Jombart T (2018) outbreaker2: a modular platform for outbreak reconstruction BMC Bioinformatics 19:363 [PMID: 30343663][DOI]
86. ^ Miller P, Marshall J, French N & Jewell C (2017) sourceR: Classification and source attribution of infectious agents among heterogeneous populations PLoS Comput. Biol. 13:e1005564 [PMID: 28558033][DOI]
87. ^ Skums P, Zelikovsky A, Singh R, Gussler W, Dimitrova Z, Knyazev S, Mandric I, Ramachandran S, Campo D, Jha D, Bunimovich L, Costenbader E, Sexton C, O'Connor S, Xia GL & Khudyakov Y (2018) QUENTIN: reconstruction of disease transmissions from viral quasispecies genomic data Bioinformatics 34:163-170 [PMID: 29304222][DOI]
88. ^ Smith RA, Ionides EL & King AA (2017) Infectious Disease Dynamics Inferred from Genetic Data via Sequential Monte Carlo Mol. Biol. Evol. 34:2065-2084 [PMID: 28402447][DOI]
89. ^ De Maio N, Worby CJ, Wilson DJ & Stoesser N (2018) Bayesian reconstruction of transmission within outbreaks using genomic variants PLoS Comput. Biol. 14:e1006117 [PMID: 29668677][DOI]
90. ^ Klinkenberg D, Backer JA, Didelot X, Colijn C & Wallinga J (2017) Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks PLoS Comput. Biol. 13:e1005495 [PMID: 28545083][DOI]
91. ^ Eldholm V, Rieux A, Monteserin J, Lopez JM, Palmero D, Lopez B, Ritacco V, Didelot X & Balloux F (2016) Impact of HIV co-infection on the evolution and transmission of multidrug-resistant tuberculosis Elife 5: [PMID: 27502557][DOI]
92. ^ Hall M, Woolhouse M & Rambaut A (2015) Epidemic Reconstruction in a Phylogenetics Framework: Transmission Trees as Partitions of the Node Set PLoS Comput. Biol. 11:e1004613 [PMID: 26717515][DOI]