Motivating example: visualizing viral genomes

3.1. Motivating example: visualizing viral genomes#

Figure: Helpful map of ML by scitkit-learn (Source)

ml-cheat-sheet

\(\bowtie\)

We consider an application of dimensionality reduction in biology. We will look at SNP data from viruses. A little background first. From Wikipedia:

A single-nucleotide polymorphism (SNP) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.

Quoting Jombart et al., BMC Genetics (2010), we analyze:

the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. […] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.

Some details about the Jombart et al. dataset:

For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.

We load a dataset, which contains a subset of strains from the dataset mentioned above.

df = pd.read_csv('h3n2-snp.csv')

The first five rows are the following.

df.head()

	strain	s6a	s17a	s39g	...	s978a	s979c	s980a
0	AB434107	1.0	1.0	1.0	...	1.0	1.0	1.0
1	AB434108	1.0	1.0	1.0	...	1.0	1.0	1.0
2	CY000113	1.0	1.0	1.0	...	1.0	1.0	1.0
3	CY000209	1.0	1.0	1.0	...	1.0	1.0	1.0
4	CY000217	1.0	1.0	1.0	...	1.0	1.0	1.0

5 rows × 318 columns

Overall it contains \(1642\) strains.

df.shape[0]

The data lives in a \(318\)-dimensional space.

df.shape[1]

Obviously, vizualizing this data is not straighforward. How can we make sense of it? More specifically, how can we explore any underlying structure it might have. Quoting Wikipedia:

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. […] Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

In this chapter we will encounter an important mathematical technique for dimension reduction, which allow us to explore this data – and find interesting structure – in \(2\) (rather than \(318\)!) dimensions.