Does Biodiversity Prevent Pandemics?

I recently read the surprisingly well-written paper Impacts of biodiversity on the emergence and transmission of infectious diseases by Keesing et al. Here are some quick notes and highlights.

First, the bottom line, which roughly “Yes, but it’s complicated” (when is it ever not 🙂

“In principle, loss of biodiversity could either increase or decrease disease transmission. However, mounting [empirical] evidence indicates that biodiversity loss frequently increases disease transmission. In contrast, areas of naturally high biodiversity may serve as a source pool for new pathogens.”

“The researchers don’t know why the effect occurs. But they speculate that species that are better at buffering disease transmission — for example because they have low rates of reproduction or invest heavily in immunity — …
tend to die out first when diversity declines, whereas species that have high rates of reproduction or invest less in immunity — and thus are more likely to be disease hosts — survive for longer.

The review looks at the question of how pathogen transmission is affected by biodiversity loss. Intuitively, “reducing biodiversity can increase disease transmission when the lost species are either not hosts for the pathogen or are suboptimal ones. For pathogens for which transmission is a function of host density, loss of diversity is most likely to increase transmission if the loss causes an increase in the density of competent hosts.”

So the question is whether there any correlation between suitability as pathogen hosts and which species tend to die out in an area first. Unfortunately, it seems that “resilience in the face of disturbances that cause biodiversity loss, such as habitat destruction and fragmentation, is facilitated by life-history features such as high reproductive output and intrinsic rates of increase. Vertebrates with these features tend to invest minimally in some aspects of adaptive immunity; we hypothesize that this may make them more competent hosts for pathogens and vectors.”

The same process seems to recur at the micro scale: – “Changes in the composition of microbiomes are frequently associated with infection and disease. For example, corals suffering from white plague disease have microbial communities distinctly different from those in healthy corals. In some of these examples, a rich microbial community appears to regulate the abundance of endemic microbial species that can become pathogenic when overly abundant. In other cases, high microbial species diversity can prevent colonization by invasive pathogenic species. For example, the more diverse the microbiome surrounding the roots of wheat plants, the more protected the plants were against invasion by the pathogenic bacterium Pseudomonas aeruginosa.” I wonder why this happens? Since here we’re not talking about pathogen hosts presumably the mechanisms are different. There could be crowding out effects if there are limited resources to be had, but then it’s just density that matters, not diversity.

On the other hand, biodiversity hotspots and humans really don’t mix well. “Indeed, almost half of the zoonotic diseases that have emerged in humans since 1940 resulted from changes in land use.”

I didn’t realise there’s another way overusing antibiotics is bad, but apart from straight up natural selection effects, “human use of antibiotics is thought to select for resistant microbes by eliminating the great diversity of non-resistant microbial strains and species that suppress resistant strains in the absence of antibiotics.”

Actually this quote doesn’t quite make sense to me – do they mean kill off other strains of the pathogen, or kill off other harmless bacteria and reduce diversity (ala the story that sanitizing toilet seats kills off the harmless skin bacteria letting them be colonized by far more dangerous fecal bacteria)?

The paper ends with three interesting recommendations:

First, potential emergence ‘hotspots’ could be predictable on the basis of land-use change and underlying biodiversity patterns; these areas should be targeted for surveillance of endemic wildlife pathogens that have the potential to jump host species40,51.

Second, preserving and protecting intact habitats in these hotspots provides a simple, direct way of reducing human–animal contact and reduces the likelihood of emergence of new pathogens, although methods for achieving reduced contact are not always straightforward51.

And third, to reduce the probability that pathogens become established and transmissible within a new host population once spillover occurs, the husbandry of high-density monocultures of domestic animals, particularly in areas at high risk of spillover, should be subject both to more intensive surveillance and to measures that reduce contact between wildlife and livestock.

The last rec seems particularly cost-effective, since the intersection of biodiversity hotspot and “factory farming site” is probably a lot smaller than either, and high-density highly-sterile monocultures are particularly good incubation sites for pathogens.

Notes on AlphaFold2

These are some cleaned-up notes based on Preetham Venkatesh’s talk to IISc Naturalists on AlphaFold2’s recent CASP win. For a basic summary of what happened, see Deepmind’s blog post AlphaFold: a solution to a 50-year-old grand challenge in biology, and for the details, Mohammed Al Quraishi’s AlphaFold2 @ CASP14: “It feels like one’s child has left home.”

What happened?

Proteins are complex biomolecules that are composed of a string of submolecules (called amino acids) connected to each other in a line. While it is easy nowadays to figure out what amino acids a protein is composed of, and in what order (its sequence), proteins don’t exist in a linear shape, but fold into weird and wonderful tangled-up structures that strongly influence the functions they have in a cell. Figuring out how they go from their original state to their final low-energy structure is the notorious protein folding problem. Instead of understanding the whole dynamics, you could also just try to predict what final structure a given sequence will form. This is very hard, since there are exponentially many possibilities, so just checking which one has the lowest energy doesn’t work. In practice, people try to “condense” many copies of a protein into a crystal – like regular array, which makes it easy to use X-ray diffraction to figure out the structure. This is considered the “true structure” though in reality the protein probably takes on a slightly different form while swimming around in the cytoplasm.

CASP is a biennial competition in protein structure prediction. The organizers pick a bunch of proteins that have just been successfully crystallized, and ask the experimentalists to keep their results under wraps for a while. Competitors submit predictions of the structure of the protein from its sequence, and these predictions are evaluated by comparing to the experimentally determined crystal structures once they are released. DeepMind took part in CASP in 2018, winning by a comfortable margin. This year, they returned and completely blew everyone else out of the water, with accuracy scores so high that for the first time, we can say the protein structure prediction problem is basically solved, instead of mostly unsolved.

There are tons and tons of caveats to this statement, with confusion added by DeepMind’s conflation of protein folding and protein structure prediction (to be fair, everyone does this). 

“Here’s what I think AF2 can do: reliably (>90% of the time) predict to reasonable accuracy (<3-4Å) the lowest energy structure of vanilla (no co-factors, no obligate oligomerization) single protein chains using a list of homologous protein sequences”

Here’s a more accurate summary from Quraishi

Nevertheless, this is a stunning scientific achievement, and a start (or continuation!) of a really exciting time for biology, and for applications of ML to the physical and life sciences.

Protein structure prediction pre-Alphafold2

CASP has two types of problems, free modelling and template modelling. A template is a sequence similar to your query sequence, which has a solved structure which you  can use as a template to tweak your model. Naturally, this makes template modelling much easier than free modelling, where you have nothing to go on. On a smaller scale, you can use the same trick, if subsequences of your protein sequence are known to fold into known structures, or standard patterns like alpha-helices and beta-sheets. The tricky part is combining the known pieces in the correct way. 

What else can you do?

An extremely cool idea that emerged in the lates 90s is using evolutionary data. If two residues (amino acids) are close together in the structure, then if one mutates, the other might be forced to mutate too. Reversing this, if you look at many sequences for the same protein, and notice that two residues always seem to evolve in tandem, that’s a pretty good hint that they’re connected somehow in the final structure.

The current (pre-AlphaFold2) approach was to use multiple sequence alignment (MSA), which essentially means lining up many sequences to extract pairs of co-evolving residues (as described above), which are then fed to a neural network which then predicts a distogram, which you can think of as an adjacency matrix keeping track of which pairs of atoms are close to each other. These pairwise distances are then used, together with templates and known pieces, to predict the final structure (in a manner I don’t completely understand).

AlphaFold2’s innovations

“A folded protein can be thought of as a “spatial graph”, where residues are the nodes and edges connect the residues in close proximity. This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history. For the latest version of AlphaFold, used at CASP14, we created an attention-based neural network system, trained end-to-end, that attempts to interpret the structure of this graph, while reasoning over the implicit graph that it’s building. It uses evolutionarily related sequences, multiple sequence alignment (MSA), and a representation of amino acid residue pairs to refine this graph.”

Image from

One of AlphaFold2’s major innovations was to avoid summarizing the co-evolving pairs from the multiple sequence alignment, but to stick with the raw sequences and use attention (in the ML sense) to figure out which parts mattered. They then build a distogram iteratively, going back and forth between the distogram (a residue-residue matrix) and the MSAs (a sequence-residue matrix). After this, they use yet more transformers, (SE(3)-equivariant ones), and template structures, in another iterative process, to directly predict a final 3D structure.

It looks like one advantage of this is that you can focus on multiple residues at the same time instead of being stuck with pairs. Additionally, you might have very few sequences for the whole protein, but lots for a small part. There will also be lots of noise, making it unclear which sequences to use. Attention lets you have a neural network learn to deal with all of this and pick the right sequences at the right point instead of being forced to do it manually.

SE(3)-equivariant transformers are a new type of transformer architecture specifically designed for modelling 3d point clouds, which are equivariant (similar to invariant but with technical caveats) to permutations, translations and rotations. It’s fairly easy to make networks permutation and translation invariant (eg translation-invariance holds for CNNs), and they use spherical harmonic functions and lots of fancy stuff for the rotational invariance. Essentially, this allows you to output something that captures the important features of a molecular structure (where the atoms are vis-a-vis each other) and not unimportant global features like orientation.

This means that their model is far more end-to-end than previous work, which (I imagine) allows them to make full use of their massive compute budget. And massive it is. Apart from training, which took 128 TPUs running for several weeks, AlphaFold2 seems to require a shockingly large amount of time to actually predict a single protein structure. According to Demis Hassabis, depending on the protein, they used between 5 to 40 GPUs for hours to days. There are two sets of iteration within the process, one when building the distogram, and one when refining the 3D model with the SE(3)-equivariant transformers, so it’s not as simple as a classic forward pass, but it’s still quite unclear why on earth it takes this long.

There is not much more that is known about how the model actually works. Most of what I’ve said above is from people reading the entrails of the diagram above, since DeepMind hasn’t really been forthcoming on the details, unlike last CASP. This is weird and sad and I hope they release a preprint soon.


It seems like this is another instance of the Bitter Lesson – that at scale, search and learning simply beats all the fancy tricks an AI researcher can hard-code. Quraishi and others did try end-to-end models before, but for various reasons, including their lack of compute, it didn’t really work out until AlphaFold2. Preetham speculated that next CASP will have people compressing MSA data into some new kind of representation, possibly one amenable to large language models, since MSA are currently very computationally intensive to work with. 

All I can say is that I am very excited 🙂

blank space

blank space

Answers to some questions I asked Preetham after the talk

Is the de novo/MSA-available distinction related to the free/template distinction or is it totally orthogonal?
De novo vs MSA is availability of co-evolutionary information to predict contacts/constraints. Free/template refers to availability of a structural template, ie, a sequence similar to my query sequence has a solved structure which I can use as a template to build my model. MSA makes use of the nearly 250 million sequences we have in the sequence database to use evolutionary information to predict constraints for folding. Template makes use of the 150,000 structures we have in the pdb (of which the number of unique folds is far lesser) to build our 3D structure upon.

Does running the model require Protein Data Bank access or is the info encoded in the weights?
Since AF2 incorporates template information, it would require pdb access.

Why doesn’t the Protein Data Bank have mammalian proteins?
If I’m not wrong, the reason is that they’re generally harder to synthesize and purify at scale. Mammalian proteins have a lot more complexity (post-translational modifications) that make them very difficult to overexpress in bacterial systems (cheapest). You need fancy-ass cells to express them, and that’s pretty expensive. And of course, getting a protein to crystallize involves a lot of trial and error, so it’s even more expensive and difficult. –answered by Raj M.

Re hassabis saying 5-40gpus, hours to days, could this be a miscommunication and he was referring to the time required to generate final solutions to all the structures in casp14?
I doubt it was a miscommunication. He was specifically asked about inference time for a single protein, and there was extensive discussion on this in the CASP discord channels and even beyond that, and at no point did anyone from DeepMind correct this.

Why did Quraishi et al not manage to make end-to-end work? Lack of compute? Lack of Alpha’s other innovations?
Quraishi tried to completely do away with evolutionary information as much as possible. He only input the query sequence and a PSSM matrix, which is the minimum evolutionary information that can be used. I think this was a big reason his model failed. He also showed that while his predictions were good at a local level, they were completely off on a global level resulting in poor GDT scores.

You mentioned language models might get used next casp. How tho?
Currently using MSAs is extremely computational intensive. There has been quite interesting work on sequence representation now, including a paper from FAIR which came out yesterday, and I expect that we might see usage of these representations instead of raw sequences as inputs. To be fair, the DeepMind team was asked if they used sequence representations and they said it was one of the things they tried but it did not impact performance much. So it remains to be seen for now. Quraishi’s RGN model did see improvement by using sequence representation so that indicates that they might still be useful.

Generating SARS-CoV-2 Protease Inhibitors with Variational Autoencoders

These are some thoughts I wrote up as a retrospective on, a project I worked on a few months ago. In future posts I hope to go into more detail on the tricky details of setting up batch processing in pytorch, and also explain how exactly variational autoencoders work.

In the summer of 2020 I spent some time participating in the Government of India Drug Discovery Hackathon as part of a six member IISc team. One of the projects I worked on involved trying to find molecules that might mess up the functioning of the SARS-CoV-2 main protease, based on looking at data from known inhibitors of SARS-CoV-1 (aka the original SARS). The problem statement suggested using a variational autoencoder trained on SMILES strings. Let’s go into what that means.

SMILES is a standard format for representing complex organic molecules as strings of ASCII characters. Here’s an example string:


and what the molecule looks like:

The goal of this project was to explore the chemical space around molecules that inhibit the original SARS virus. Unfortunately, since molecules are fundamentally discrete objects, it is difficult to smoothly interpolate between them. One possible way around this is to use variational autoencoders.

Autoencoders are a type of generative machine learning model that can be used to generate new samples from a distribution defined by some training examples. A standard autoencoder simply compresses an input (as a vector) by encoding it into a lower dimensional space and then trying to decode this back to the original. The idea is that the information bottleneck created will force the autoencoder to learn a better representation (in the lower dimensional latent space) of only the important information in the data set. One can generate new samples by supplying random inputs to the decoder. 

A variational autoencoder (VAE) is a type of autoencoder that lets you explore the effects of small changes in the latent space, by encoding the data as the mean and variance of a multivariate normal distribution, which is used to generate samples with the aid of a pseudorandom number generator. (in other words, the randomness is disentangled from the rest of the encoding) Altering the mean and variance of the distribution allows you to explore a slightly different space of samples, in a way that is not possible with an standard autoencoder.

At a high level, the code is a pipeline that takes a set of training examples that are the SMILES representations of known SARS-CoV-1 inhibitors, and generates SMILES strings that represent new molecules drawn from the same distribution, which will hopefully have similar properties.

The largest conceptual hurdle in the project was figuring out how to apply a VAE to strings. Normally, especially with images, the standard procedure is to convert everything to a vector of real numbers (which fits easily into the VAE framework) using one-hot encoding. Unfortunately this would not work here, for two reasons. One was that different SMILES strings have different lengths, the second was that SMILES is designed in such a way that the order of characters is very important, and there are long-range dependencies between them. Only a tiny minority of character sequences form valid SMILES strings.

The solution, which was surprisingly simple in retrospect, was to separate the model into three pieces, first, an LSTM-based encoder that would embed the string into a vector space, then a VAE, and then another LSTM-based decoder which would decode the output of the VAE back into a SMILES string.

This worked quite well, with nearly all output being valid SMILES strings. The technical issue that arose was getting this multi-stage model to batch process multiple strings at the same time. For this, the pytorch tutorial on sequence-to-sequence natural language processing ended up being invaluable. I finally used almost exactly the same process as machine translation, except one level down, (so sentences became strings and word tokens became characters) and with a VAE stuck in the middle.

One major design decision I had to figure out was whether to use an attention mechanism as part of the decoder, as this might have made it a lot easier for the VAE to escape from the constraints of the SMILES format and pay more attention to molecular similarities.

Three considerations that influenced my decision were model complexity, data availability and compute constraints. I wasn’t confident I would be able to debug problems with the attention, and I thought it would be better to get a simpler version working sooner, due to time constraints. I also felt that adding attention would significantly increase the size of the model, which would severely impact the speed of my iteration cycle due to compute constraints, and that the amount of data ( less than a thousand samples) wasn’t enough to really see a gain from attention. I eventually decided not to use attention for these reasons.

Here are some of the molecules the VAE came up with. (Yes, there are only two distinct molecules in the picture). It was surprisingly hard to get the VAE to output samples that were distinct from the training data and still valid molecules, but I could usually get at least one new molecule from every 50 samples.

Looking back

In hindsight, it was quite difficult to evaluate the quality of the final generated samples. While it was of course easy to eyeball the molecules and check they made chemical sense and weren’t memorized from the distribution, the goal of the project was to find new inhibitors, and the only way we could evaluate that was by actually running molecular dynamics simulations and checking the energy of the binding between the molecule and the SARS-CoV-2 protease. Of course, this is computationally very intensive, but we did have the resources and expertise to do it. My mistake was delaying the project so long that there wasn’t time to iterate after actual evaluation of the molecules. I should have figured out a minimum viable model as quickly as possible so we could iron out the inevitable bugs and difficulties in the MD training process and have a truly end-to-end training pipeline. More importantly, I should have realised that the constraints affecting my teammates with simulation expertise (which I was perfectly aware of) would significantly influence my development timeline. In retrospect, I should have figured out early on when my teammates would be free to run simulations and adjusted my schedule in that light. All days are not equal. 

Another way I would do things differently is to spend a lot more time looking for more data and try out much larger models. Recent experiments with scaling and papers like double descent make it clear that bigger models simply work better, and at scale, the overfitting problem goes away. I might also try including attention, because it really seems large transformers can do anything.

There isn’t much data on SARS-CoV-inhibitors, so I would need some sort of fine tuning approach, where I initially train the LSTMs on ordinary drugs and then finetune the VAE on inhibitors. 

Finally, I wasted a lot of time confused over whether my ideas on the vague ideas on the LSTM+VAE approach made sense, without actually writing anything down. Perhaps the most important lesson I took from this project is that the fastest way to crystallise an idea is to write it down as code.


The ipython notebook is available at,

On the 2D Rotation Rounding Problem

Here’s a simple dynamical system. You start with a 2D integer vector v = (x,y) . Now you rotate this vector around the origin by a fixed angle. This gives you a vector Mv (where M is the rotation matrix). You now round this vector to the nearest integer lattice point, producing a new integer vector [Mv]. You can repeat this process of rotation and rounding to produce [M[Mv]], [M[M[Mv]]] and so on.
The 2D Rotation Rounding Problem is very easy to state: Does this sequence always end up repeating?

It doesn’t sound very hard, and if you change the question slightly, it becomes quite simple. If there was no rounding the answer is YES if the angle is a root of unity, and NO if it’s not (essentially by definition).

If we always rounded down (towards the origin), there’s only a finite number of integer vectors at the same or smaller distance from the origin and so by the pigeonhole principle , eventually you must repeat.

But by going to the nearest integer point we sometimes round up, and its juuust possible that – to quote Joel – “the planets align perfectly” and you keep getting to round up often enough to escape to infinity and beyond.

I tried testing a few examples.

Here’s an illustration of what happens with v = (83,143) and angle 36  \pi /47

Colors change from dark to light over time

It repeats at the 715th iteration.

In all the examples I tried, the sequence started to repeat within the first 1000 iterations. The authors of Reachability in dynamical systems with rounding never found a non-repeating sequence either, and they conjecture none exist.

The trouble is that large starting vectors are more likely to diverge to infinity (more options to avoid being pigeonholed), so running simulations for a finite number of starting points doesn’t confirm anything.

So the only way forward is to prove that all sequences repeat (or not). But how?

Points that get rounded up (yellow), down (red) or stay the same distance from the origin (blue).

How Come Deep Neural Networks Generalize?

This is a short review of the paper Understanding Deep Learning Requires Rethinking Generalization by Zhang et al, published in ICLR 2017, which I wrote as part of the application to the Google Research India AI Summer School.

The paper Understanding Deep Learning Requires Rethinking Generalization by Zhang et al, examines the question: Why do deep convolutional neural networks generalize well?

More precisely, models like AlexNet and its successors are able to achieve 70% test accuracy (90% considering top-5 guesses) on image classification using the CIFAR10 and ImageNet datasets.

In statistical learning theory, bounds on whether a hypothesis class can generalize to unseen data depend on whether it can shatter (informally, represent every possible input-output mapping) the training data. Intuitively, if the model can memorize the dataset it need not learn anything useful for unseen data. When dealing with highly expressive model classes, regularization methods (such as weight decay) are used to penalize expressivity and hopefully force the model to learn some kind of signal in the data, and not memorize.

Zhang et al show that this can’t be going on in deep conv nets, because these networks easily learn random labels to 100% train accuracy, even with regularization! They can indeed memorize random data, so regularisation is not doing its (theoretical) job of bounding expressivity. (One exception: AlexNet with both weight decay and data augmentation failed to converge on CIFAR10) Another point against the role of regularisation is that dropping it reduces test accuracy by only about 5%, so it’s not the main driver of generalisation ability.

Another hypothesis is that something about the structure of convolution neural networks makes it easy for them to learn images. (perhaps filters help pick out edges)

However, the authors show that conv nets easily learn images of Gaussian noise. It did take longer to learn random labels than random noise, which may indicate that some kind of forced declustering of natural clusters occurs.

The authors also find that regularisation seems to have benefits in optimization and prove a theorem showing that two layer ReLU networks can memorize n data points in d dimensions using 2n + d parameters. Note that the models used have about 1.5 million parameters, while ImageNet has 1.3 milion images (50000 for CIFAR10), so memorisation is certainly a plausible outcomes.

Thus we are still left with the question: How do we explain the impressive generalization ability of these models?

The authors analyse stochastic gradient descent in the context of linear models, and show that it finds the minimum-norm weights, which is promising, as it seems like SGD is implementing some kind of implicit l-2 regularisation. However, preprocesing the data leads to solutions with higher weight norms but better test accuracy.

Clearly, statistical learning theory cannot currently explain why deep (and large) models generalize well. So why do they?

One hypothesis (related to the manifold hypothesis) is that the test data simply isn’t that different from the training data, and these models are simply interpolating the test data based on the clusters they’ve learned. The models have enough parameters to disentangle the data manifolds of the various classes. (see Chris Olah’s post Of course, this still leaves us with the question of how these highly expressive models manage to find the correct manifold representation in the first place.

It’s also possible that there’s some overfitting going on. Madry et al’s recent paper From ImageNet to Image Classification: Contextualizing Progress on Benchmarks finds that roughly 20% of ImageNet images contain more than one object, making the correct annotation unclear. They note that for a third of all multi-object images the ImageNet label does not even match what new annotators deem to be the main object in the image. Yet, even in these cases, models still successfully predict the ImageNet label (instead of what humans consider to be the right label) for the image. This lends support to the theory that ImageNet images are very similar to each other. While considering top-5 accuracy helps alleviate issues with multi-object images, it leads us to overestimate accuracy on single-object images.

To conclude, it’s interesting that this paper was written in 2016, and as far as I can tell we still don’t really know why deep conv nets generalize! I’m pretty optimistic that rapid improvements in visualization methods will help us see what’s going on more clearly.

Qubits and the Kernel Trick: The power of working in a bigger space

I’ve been reading Mermin’s account of Deutsch’s Problem, and it’s given me a new insight into where the power of quantum computing comes from.

If you look at the original problem, you have a function f(x) from {0,1} to {0,1}. There are 4 such functions.

The four distinct functions from 0,1 to 0,1

What can you learn by calling the function once? If you call it with 0 as input, you separate 1,2 from 3,4. By calling f(1) you separate 1,3 from 2,4.

Can you separate 1,4 from 2,3? In other words, can you answer the question “Does f(0) equal f(1)?” with a single function call? It’s impossible: Classically! (Try it and see)

In classical computing you have some bits and the only thing you can do to them is flip em from 0 to 1 and back and permute them around.

Quantum bits have additional properties. They can be rotated! More precisely, if you consider 0 and 1 to be two orthonormal vectors in a 2-dimensional complex vector space, quantum bits can be any unit vector in that space. The idea of a complex linear combination (of the computational basis vectors 0 and 1) is something that makes no sense in classical terms. Yet we can do operations with those combinations, rotate back into one of the classical bits, and get out a classical answer. Deutsch’s algorithm (explained on Wikipedia) uses Hadamard gates, which let you apply the function f (once!) to the qubit represented by (\frac{|0\rangle}{\sqrt{2}} + \frac{|1\rangle}{\sqrt{2}}). After some more hadamard gates you get to measure a bit which is 1 if f(0) equals f(1) and 0 otherwise. Note that you do not get to know either f(0) or f(1), so you haven’t magically gained more information from your single function call, just different information which you could think of as “perpendicular” to the classical way of seeing things.

I just realised that there is something which forms a useful analogy for the power of quantum computing: the kernel trick in SVMs!

Support vector machines work by taking dot products of data vectors with special vectors called support vectors. We might have some data which may not be linearly separable in the space we’re working in, but might be separable in a higher dimensional space. In other words, there might be some fancy function \phi such that working with \phi (x) \cdot \phi (y) lets us linearly separate things even though x \cdot y doesn’t work. The kernel trick is that we can actually compute \phi (x) \cdot \phi (y) from x \cdot y without having to compute \phi(x) or even work in the high-dimensional \phi-space at all. So what we do is we take the dot product x \cdot y (linear), and compute some nonlinear function of it (say we take the square). Magically, this function turns out to be the dot product of \phi (x) \cdot \phi (y) for some function \phi !

Of course, the caveat is that you can’t get just any function this way, but in practice you’ll easily find some function which lets you separate the data.

To work through an example of what I described above: Let x = (x_1, x_2) . Then x \cdot y = x_1y_1 + x_2y_2 . Suppose our kernel function was K(x,y) = (x \cdot y)^2 . Then we get K(x,y) = (x_1y_1 + x_2y_2)^2 = x_1^2y_1^2 + 2x_1x_2y_1y_2 + x_2^2y_2^2 ,which can be written as (x_1^2,\sqrt{2}x_1x_2,x_2^2) \cdot (y_1^2,\sqrt{2}y_1y_2,y_2^2). But this is equal to \phi (x) \cdot \phi (y) , where \phi is the non-linear function \phi (x_1,x_2) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)!

So using kernel functions, by sacrificing other information to only learn about dot products, we’ve managed to leverage the power of computing in the \phi-space, without actually needing to explicitly compute \phi (x).

Analogously in quantum computing, interference among complex linear combinations of classical bits let us compute (classical) answers to questions that cannot be answered using only classical operations. Note that we do not (and cannot) get “other” information like the complex coefficients of the qubit.

Kernel TrickQuantum Computing
xClassical bit ( 0 or 1)
\phi (x) Qubit in quantum superposition ((\frac{|0\rangle}{\sqrt{2}} + \frac{|1\rangle}{\sqrt{2}}))
\phi (x) \cdot \phi (y)Classical indicator function 1_{[f(0)=f(1)]}
K(x,y)Quantum circuit
Because every good analogy deserves a table of correspondences

Book Review: T. Tembarom by Frances Hodgson Burnett

Frances Hodgson Burnett (henceforth FHB) is well known as the author of The Little Princess and The Secret Garden, which are books I vaguely remember as classics from my childhood though I cannot swear to having actually read them. It turns out she was a bestselling author in her day, and wrote at least twenty other books, including the interestingly titled T. Tembarom,  published in 1910.

I decided to read it solely because of the description in AJ Hall’s review.

The book starts with a philosophical argument that our sources of information about the world are skewed away from learning about most people, who are fairly nice and good, as is the nature of human beings, if there isn’t something aberrant about them.

No one has ever made a collection of statistics regarding the enormous number of perfectly sane, kind, friendly, decent creatures who form a large proportion of any mass of human beings anywhere and everywhere—people who are not vicious or cruel or depraved, not as a result of continual self-control, but simply because they do not want to be, because it is more natural and agreeable to be exactly the opposite things; […]  When one reads a daily newspaper filled with dramatic elaborations of crimes and unpleasantness, one sometimes wishes attention might be called to them—to their numbers, to their decencies, to their normal lack of any desire to do violence and their equally normal disposition to lend a hand. […] They really form the majority; if they did not, the people of the earth would have eaten one another alive centuries ago. But though this is surely true, a happy cynicism totally disbelieves in their existence. When a combination of circumstances sufficiently dramatic brings one of them into prominence, he is either called an angel or a fool. He is neither. He is only a human creature who is normal. After this manner Tembarom was wholly normal.

FHB seems to imply that most people are like T. Tembarom, but if there is one thing that the story makes clear, it is that T. Tembarom is most certainly not normal.

T. T. (as his friends call him) is blessed with almost superhuman good cheer, bounding energy, and a near complete inability to feel sorry for himself. Scarcely less unusual is the story’s other protagonist, Little Ann Hutchinson. They are both without flaw and never put a foot wrong throughout the book. If this was a modern novel, would I have thrown the book against the wall for their sheer Mary Sue-ness?

To illustrate, I can’t do better than quote A J Hall’s description of how the book starts (from the review I mentioned) :

T.Tembarom is the preferred name of the hero, whom we first meet at the age of ten, when, his mother having just died in a New York tenement, he spends his last cents, and twenty more cents he borrows from a neighbour on buying some newspapers to sell. After this, by dint of hard work, undaunted good humour, nightschool shorthand, apparent imperviousness to hypothermia and general applied niceness he works his way up to his a lucky break, namely the opportunity write a Society column on a Sunday paper, whose circulation appears to be predominantly in the Bronx, at the princely sum of $25 per week.

The only problem is that he doesn’t have a clue (a) how to gather material for the column; and (b) how to write it up once he’s got it.

In this dilemma he turns to “Little Ann” Hutchinson, who is living in the same boarding house as him, along with her father, who is a disillusioned inventor from Lancashire.


Every young man in the boarding house is in love with Little Ann, who darns everyone’s socks, mainly I think to give her hands something to do while her formidable intellect and talent for headology is clicking into gear.

Little Ann’s sensible suggestion is to start gathering material for Society weddings with the wedding caterers. T.Tembaron runs with this suggestion, even though there’s a blizzard going. Little Ann does not say, “Do not go out into the blizzard scraping acquaintances with wedding caterers.” Little Ann lends him her father’s muffler and tells him to make sure he gets the names of the dress fabrics right. T.Tembaron, having made a huge hit with the caterers, befriends a dressmaker and gets samples so he can learn them by heart and spot them at sight. And when his first attempt at the column is rejected by his editor on the basis that it’s full of purple prose Tembarom has plagiarised out of other newspaper columns and his $25 per hangs by a thread, she ruthlessly sub-edits him, too.

The course of true love running moderately well, and T.Tembaron starting to look hopefully at advertisements for small flats in high rise buildings in the newly built upper 140s and lower 150s, he is devastated by twin blows: first, Mr Hutchinson’s pitch of disillusionment has now become so elevated that only a return to Lancashire will assuage it, and Little Ann will, of course, dutifully accompany him, and, secondly, at the ‘oyster stew’ thrown by way of farewell party to them at the boarding house, a very stuffy London lawyer shows up and reveals T.Tembarom’s real name (and since it’s Temple Temple Barholm, you can see why he’s concealed it for the last quarter of the book) and announces that he’s — ta da! — the missing heir to the village and estates of Temple Barholm, in Lancashire. At which point Mr Hutchinson is struck all of a massive heap and confesses that he was born in that very village! And he learned his letters at the village school! And his aged mother lives in a cottage there still!

And Ann, very sensibly, says no, she isn’t going to marry T.Tembaron straight off, nor is she going to put up with his cancelling their steerage passages for tomorrow and booking them first class staterooms so they can travel along with him in three weeks time: he’s got to get used to being Temple Barholm of Temple Barholm, lord of the manor with £70K per annum, and see all the Society beauties whom he is now eligible to marry, and when he’s had a year of that she’ll think about it.

And from then on the plot moves to Lancashire and more shenanigans ensue.

I found the book paradoxically very slow to read but entertaining. The kind of book that is supposed to be your sole entertainment for the entire week. And this appears to be a linear function of age, since I once tried reading The Woodlanders by Thomas Hardy, and that was clearly supposed to occupy you for a month. It took me weeks to read the first half of T. Tembarom, but sometime ago I was stuck on a train with nothing else to do (as God and FHB intended) and the second half was quite the page-turner.

The best part of the book was how inspired I was by T. Tembarom’s unflagging determination and even less flaggable good cheer. The man spends significant portions of the book trudging through blizzards and being icily snubbed by high society beauties. And while he might not be grinning at those particular times (he spends most of the book with an ‘extraordinarily friendly grin’), he sure as heck doesn’t let them get him down either.

The other thing I got out of it was a renewed appreciation for how lucky I am to be born in the 21st century. Nearly all the choices I have right now in terms of career options and people I could talk to are more interesting than those available to a character who is a literal duke.

Something that really stands out is how the position of women in society has changed. There are two characters, upper-class women who see no possible future before them apart from making a rich marriage and starving in genteel poverty. However, it’s hard to tell how much of this is the British upper-class refusal to stoop to ‘trade’.

I was far more frustrated by Little Ann Hutchinson’s attitude to her father and FHB’s implicit approval of her attitude.
The man is utterly without the slightest shred of self-awareness and humility. Little Ann spends half the book carefully steering him out of the clutches of various conmen and massaging his male ego. While Mr Hutchinson bloviates about the native shrewdness of Lancashire folk and how “women are not up to much at business”, Little Ann, under the guise of “I am but a girl and know nothing of business of men’s affairs”, exclaims about how adroitly he handles all the speculators that come to them and how clever he must be to find the loopholes in their proposal. Quickly enough Mr. Hutchinson begins to think that it was actually him who figured out all the holes in the conmen’s stories, and talks with extreme condescension towards little Ann. In Little Ann’s place I would started punching walls. On the other hand, I’m not sure how I feel about the way Ann treats her father’s character as an obstacle to be worked around. Why not be a bit more direct? It feels a little manipulative, though it’s very clear she loves him, and wants him to be happy and keep his illusions unshattered (while making sure they don’t get scammed).
While I’m not sure I could do it, it seems as if, given the constraints of her time and place in society, little Ann behaves in  the way that is optimal for achieving all her varied goals.
I suppose I give the man some credit for realizing how much better things go when Ann’s around, and making sure she’s always with him when embarking on a new business transaction.

While the book was published in 1910, it feels like it’s describing a much older time, and indeed the mention of President Garfield’s assassination puts the story in the 1880s.

Would I recommend this book?
Tough to say. I certainly don’t regret reading it, and I intend to continue singing the phrase ‘T. Tembar-om-om’ to the tune of the hymn ‘Here I Am, Lord’, anytime I need to be reminded of the power of approaching life with an “extraordinarily friendly grin”.

I did, however, approach it in exactly the right spirit, (thanks to A J Hall) prepared to look past FHB’s insane Victorian morality and style to find the good parts.

It’s always an interesting exercise to read old books, since even if the plot isn’t very compelling, you can always turn it into an anthropological study of the author’s milieu.

If you can stand the pace, and T. T. thinking about how he wants to embrace little Ann for the hundredth time, T. Tembarom‘s free on Gutenberg – have fun!

Summer Project 2018 : Intro

  • A lot of people have been asking me what I’m going to be doing this summer, and I have been very vague, for the excellent reason that until today, I didn’t know either. I’ve been spouting buzzwords like functional programming and nlp and machine learning in every direction.
  • I will be working with Prof Siddhartha Gadgil from the IISc Math Department.
  • The overarching idea is to take mathematics as it is done today and rest it on the foundation of homotopy type theory (which is more scalable, so this is not as crazy as trying to do the same with set theory). This requires several steps:
  • Step 1: Convert mathematics as written in papers to something controlled.
  • Step 2: Translate the controlled language to HoTT
  • Step 2 requires a lot of knowledge of homotopy type theory, so I’ll be working on step 1
  • Step 1 has two parts
    • Step 1A : Parse latex source (the typesetting languge math papers are written in – you thought they’re written in English?) and convert it into a dependency tree, which is mostly the same words, but tagged with their part of speech, and what other words they refer to.
      • Fortunately there exist parsers, such as the StanfordParser, which can mostly do this for us. As Mohan Ganesalingam pointed out, mathematics is a lot easier to parse than english in general. There will be issues, of course, especially with latex formulae, but we’ll figure things out. (I hope.)
    • Step 1B : Convert the dependency tree to the Controlled Language (Gadgil calls it the Naproche-like CNL, but naproche looks like English, while this stuff sure doesn’t, so I prefer to just call it Controlled Language
      • The plan is to do this via a recursive functorial framework in Scala
      • The framework exists, but not most of the translation rules, so my job will be to come up with these rules.
  • So now I have to quickly learn a bunch of skills
    • How to program in Scala
        • How to run stuff, build packages etc
        • Understand what functors and monads are.


    • How to use Github (I’m really embarrassed I don’t know this yet)
    • How to parse English
    • How to identify and describe patterns in mathematical argument.
  • And I’ve gotta read two PhD theses.

Indian Logic

Epistemic Status: Spitballing about deep philosophical questions

A couple weeks ago, at a frisbee tournament, a teammate of mine called Deepak told me about his studies in Indian logic, by which he meant a certain style of philosophical argument in the Upanishads.

Indian logic, [he said] had syllogisms, but was not completely deductive, in that you could not have a syllogism that was practically false, but logically valid. (As in: All men have five legs. Socrates is a man. Therefore Socrates has five legs.)

There has to be an inductive part, that is based on an empirical observation of the world. The example he gave was:

Where there is smoke there is fire, as we see in a kitchen. There is smoke on the hill. Therefore there is fire on the hill.

Now this syllogism is true (valid? – I’m not sure there was a distinction) and accurately describes the physical world, until somebody comes up with an empirical example of there being smoke without fire, at which point it becomes false.

The way he put it was, you *have* to accept this, unless you can come up with a counterexample.

This reminded me of stuff I was thinking about at ESPR, the question of dealing with arguments you can’t (in the moment, at any rate) logically refute. I could always resort to epistemic learned helplessness, but that’s boring and means I never get to change my mind.

In the conversation, though, the first thing that came to my mind was, to quote my friend Andrew, “That’s bullshit!” . More precisely, “My prior on logical or empirical arguments of the kind you are making being undone by later findings is pretty damn high.”

Then Deepak pointed out that this is pretty much how science works. We do believe the current paradigm, until such time as it is overthrown. So now I’m not sure. There should be a way of distinguishing the two cases (when you should believe and when you shouldn’t), and I can vaguely sense it, but I’m not sure. Plausibility? Our intuitions get worse and worse the further we move from direct experience. (eg, quantum mechanics) Track record? What actions and consequences are implied for us ? That seems a little…. how do you say… biased. Meta level considerations? What somebody else stands to gain based on persuading you?

How do you figure out what is true?

A Perfect Moment

December 2048, a city in the Northern Hemisphere

“Eeek! You’re bleeding, Jonathan!” she said, tottering into the living room, fumbling the door shut against the chill and gloom.

No No NO there was blood dripping down both his wrists. The knife was still quivering gently, stuck in the wood of the rocking chair’s bottom.

“You promised,” she muttered, bandaging the limp wrists and spraying on the Liquid Band-Aid, “that you wouldn’t off yourself when I wasn’t around.”

Two well-deserved slaps had brought him round soon enough.

“I’m sorry, Rashi,” said Jonathan, eyelids half closed. He tried to shrug and gave up, wedged shrunken into the chair.

“I was…. Honestly, what else was I to do…”

Rashi looked into the distance. The winter gale hissed at the dark window, then abruptly died down.

“Yeah, I know,” she said, heavily. “I go out for my first walk in ages, wearing the thick plastic raincoat, my feet are blue with the chill and Mrs Simla trills at me — ‘What darling weather it is, my dear!’ I wanted to claw her face off, the bitch.”

“Was she wearing a disco?” asked Jonathan, still slumped in his chair, like always.

“Of course. That’s what I meant, Jon. Claw her face off, eww,” said Rashi, falling into her chair, also facing the door, next to her husband’s.

Almost involuntarily, her eyes flicked over to the two dusty disco masks on the table, shoved against the wall, nearly covered by the piles of scorched-looking paper flung there after Jonathan’s abortive attempt to interest himself in charcoal sketching. She had long given up trying to find a hobby for herself. Hobbies in the general population were all but replaced by the disco version, better in almost every aspect.

The name came from “like having your own private discotheque”, a breathless sentence in one of the earliest reviews of the virtual reality tech. Disco masks weren’t virtual reality, not quite. You could still move around in the real world, but your perceptions were altered. A biting wind might feel like a cooling breeze, the silence of the city was replaced by whatever music the software thought you liked best at the moment, and the thin, sharp electrodes in the innermost layer, touching the skin of your face, told the circuitry what kind of surroundings your neurons wanted to feel. It never itched, of course. Soon you’d stop noticing it ever existed. The most important effect, of course, was that you couldn’t feel bored or depressed. Something new popped up in front of you every thirty seconds if necessary, and your dopamine levels were carefully adjusted by the electrodes.

They were banned in orbit, where every sense needed to be on high alert. On Earth, however, they were perfectly legal and almost universally worn. Why wouldn’t you? After all, people who had taken off and damaged their discos in remote areas had ‘succumbed to ennui’, as the monthly obit newsflash put it.

Without disco, what was there to do?  There were no jobs to have on Earth, not that anyone wanted one. A hobby? One got bored, of course. She pictured Maz, dear son, walking in, a year from now, to two slowly decomposing bodies in identical rocking chairs and red wrists…She shuddered. But Jonathan had already tried to kill himself twice. She herself wobbled frantically on the edge of that precipice. Could they endure that long?

“Stop looking at those things,” said Jonathan, vaguely circling his finger in the direction of the table. He had drawn circles, and far more exotic shapes, as a Pre-Millennium artist, before disco made everyone into a critic who found their own scribbles the most fantastic art in the universe. Now his work lay scrunched on the table. Nobody cared. They could all do better themselves.

“It’s a year before Maz gets back from his asteroid mission,” said Rashi. “I know they have the regen medicine, but still, what if a pirate miner gets a lucky shot into his brain? I warned him never to take off that ridiculous helmet of his. But he likes the danger. What did he say last time? ‘the adrenaline rush of chasing matter thieves halfway across geosync orbit’ ? But does it have to be him? Can’t somebody else keep the economy running?”

Rashi tossed her raincoat vaguely at the table. It hit the floor with a slight squelch. She collapsed back into the chair, her old bones tired from the short walk, plodding through the mud.

“You know, Jon, I could manage when Mrs Simla at least didn’t wear disco and I could talk to her sometimes. Now…. there’s nothing to do at all… If you’d gone out for that walk, I might have been the one playing with the knife.”

“So what do you want to do? Give up? Say goodbye to the real world, only somewhat less permanently than I tried, and put on a disco? Live in a castle in the air?”

“It’s a lot better than life down here.”

“Pah. You’re right about that. I don’t have a single friend left who wouldn’t smile at me beatifically. If they saw me at all.”

“At least they’re smiling. When was the last time you smiled?”

“You think those fake smiles count? Mrs Simla thought it was her birthday again last week. Said to thank you for the beautiful cake. She grinned happily at me while I told her you haven’t baked in years.”

“So if it’s not connected to something in the real world it’s not meaningful? What about happy dreams?”

“The smile on your face when you wake up counts. Not the ones before. It’s only a dream if you can wake up from it. Otherwise it’s a nightmare.”

“I don’t know, Jon. I know it’s addictive. I know the joy isn’t real. But it’s been a while since I’ve experienced real joy. I think I might have forgotten how to tell the difference.”

“If- no, no when Maz comes back you’ll tell the difference all right. If you’re awake and alert to see him walk through that door.”

“That’s the only reason I haven’t used that knife. But Jon, that’s a year away. Dunno if I can hold out for a year- without disco.”

“And when he comes home to see us grinning and staring past his face like zombies?”

“We would take it off then!” said Rashi, sitting up straight.

“What, like Salvator, that guy on the flash who dropped his disco overboard while on a ship and decided the only true excitement in life was swimming with the sharkies?”

“Oh no, it’s really juggling with the knifies, of course,” snapped Rashi.

Jonathan inhaled and said nothing.

It started to drizzle. Small raindrops rolled lethargically down the window pane and pooled on the icy frame.

The rocking-chair creaked as Jonathan sat up with effort.

“Gimme that disco.”

“I swear to you, Rashi, the minute Maz walks through that door, that thing is off my face and in the bin.”

Rashi picked up the discos, shuffled over, pecked Jonathan on the cheek and passed him a disco mask.

“Ready, darling? One year. And then we won’t use it, ever again.”

She slipped hers on.

Aaaaaah. Mrs Simla was right. It really was darling weather. The window opened and a perfect summer breeze swirled around the room. Rashi wriggled her toes with delight. Maz didn’t know what he was missing. Anyway, it was time to think about that superb tapestry on the wall. When had she gotten so good at embroidery?

Perhaps she would start again. Why had she ever dropped it? Such fun…..




…..Rashi was playing chess against Magnus Carlson, world champion since 2036. And losing. Or so he thought. She smirked, tapping a rook against her teeth. Tap. Taptap. A bit boring, playing Carlsen. She’d be world champion next, of course. Tap. Taptap. Tappity-tap tap tappitytap tap tap. Wait. Tap taptap tappitytap?

Carlson disappeared. Something stirred at the back of Rashi’s mind. I know that sound….

Yes! That’s Maz! Mazmazmaz! Maz is back! She ripped the odd plastic off her face and spun out of her chair. Across the room was a door. Almost tripping over her feet she yanked it open. Maz stood, silhouetted in the bright sunlight.

Rashi sighed, joy soaking into her bones deeper than ever before. Maz is here.

Maz, alive-and-safe Maz, fell into her arms. The perfect summer breeze gently swirled the door shut.