Can't Stop Them All

25 Oct 2022 security side channels

Here’s a security belief: we can mitigate side channels in cryptographic sofware. I don’t hold this belief.

The story that repeats over and over goes something like this: engineers want fast crypto algorithms¹. Eventually, some researcher finds that these algorithms leak state via software or hardware optimizations, rendering the cryto vulnerable. Since hardware is basically fixed for the near future, we have to rewrite the algs to avoid leakage, which we do. We now have safe crypto.

This isn’t working. At the very least, I don’t think we get any guarantees, just “we think this is safe”. To me, this isn’t acceptable for crypto. For the sake of this post, we’ll just consider “passive software attackers”, so we’ll exclude things like physical attacks (e.g. measuring electromagnetic signals with probes), as well as software attacks which directly cause bit changes to the victim (e.g. Rowhammer or causing undervoltage).

Let’s recall some prior leakages. Crypto can’t leak (much)², or it’s useless. Sometimes the leak is obviously a software construction: we don’t want to calculate things we don’t need. Your favorite “square and multiply” implementation is an example: we don’t multiply when it’s not going to be used in the result. Even “safer” variants using windows (pick either direction) still leak enough. The solution is to eliminate branching: branching “footprints” won’t be left in cache, so an attacker won’t be able to infer bits at the cache level. So, we should write algorithms to avoid branching, and avoiding data dependent accesses to cache.

You can even get a “formal proof” of avoiding cache dependent behavior using some research tools, but I’d bet reality doesn’t fit the formal specification (i.e. your verification is meaningless). The authors of this claim their side channel squeaks by formal tools. I actually have doubts about this claim (because the code they attack definitely does data dependent cache accesses), but the side channel is still interesting, and their points about cache accesses not made directly by the code, but “triggered by the code” (e.g. hardware prefetching), is something a formal proof would need to consider to really prove something “cache side channel secure”.

Here’s the quick algo background for the code they attack. You don’t need to know much about elliptic curves, but basically, in the group (defined by the elliptic curve), you calculate “scalar multiples” of a public “base” point P to get a point Q=kP. Scalar multiplication is really just repeated group addition³. Calculating k, given Q and P is believed to be hard in some groups (consult your local cryptographer for which groups). Calculating k given Q is the discrete log problem, so if you don’t want to read about eliptic curves, just think of it as El Gamal over a different group.

Anyway, you can write the algorithm to branch on the secret k. Obviously, this is insecure, which we learned from RSA implementations. It’s computationally cheaper to fix, because it turns out that calculating the “montgomery ladder” has some antisymmetry we can take advantage of: the calculation just switches arguments around (to subroutines) in case the secret bit is 1 or 0, so we can instead take advantage of conditional moves and just call the same function⁴. As long as the subroutines are “cache safe”, we are good. (In reality, cmov could have problems in the future, as pointed out here. Also there’s good examples of other issues I’m not complaining about in this post).

One thing that is a “sore thumb” if you are familiar with cache attacks is that for efficiency, there is indexing into a “squaring” table⁵. But, if you can monitor the accesses to this table (which is undoubtedly cached), you can leak the indices, and if the indices leak info about the secret bit, game over. This can be “safe” in some instances: if your table is extremely small enough to fit into a single cache line, you won’t be able to distinguish values. Even if they span two cache lines (which looks to be the case), if the memory accesses are dense, you may have trouble distinguishing between cache lines, which has to do with cache attack resolution⁶. To summarize, if you can’t view the sequence of accesses, you don’t get any bit information.

But what if there is an oracle that gives you some information about the sequence of accesses? Normal cache attacks would do this, but remember, this sequence is too dense. The oracle we are looking for turns out to be a hardware prefetcher. If you read the paper, the hardware prefetcher fetches different (distinguisable) cache lines based on the access sequence to the table ⁷. So the oracle tells us something about the sequences, and we only have to monitor a few separate cache lines during the dense sequence access, instead of trying to monitor the sequence directly. It’s not clear from reading the code exactly how this leaks bits, but there have been prior techniques (templating) in order to map input -> cache line distinguishers, and the authors make it work for cache line sequences. Also, again, this is not perfect information, but it’s enough!

This is already crazy to me, because, “engineering wise”, it seemed safe: get rid of branches so we cannot trace code via cache. Precompute (we need some speed, especially considering that past side channel mitigations usually cost performance), and make sure the precompute table is tiny enough so that cache monitoring doesn’t work. If we only span two lines, then we leak only if we can reliably distinguish between sequences of accesses to these two separate lines, which is probably hard, because noise/lack of resolution will probably merge “separate sequences” together, hiding bit differences. But this turned out to be false.

Also, do we just give up prefetching? I am not sure. I know that Linux kernel developers have complained about prefetchers hurting performance, so disabling them may be a good enough solution. But as architects propose more exotic prefetchers (maybe some will be useful!), I suppose you’ll always have a distinguisher lurking.

Here’s the last example, MemJam. The authors attack a few crypto implementations, but particularly Intel’s implementation of AES. If anyone should get constant cache crypto correct for Intel processors, it’s a combination of cryptographers and engineers who work for Intel. Anyway, they (Intel) avoid the T-table, which is known to be vulnerable. Intel’s implementation still uses an S-box, which spans 4 cache lines, but importantly, every time they access the S-box, they load a word from each cache line. Thus, cache attacks won’t work here: the code always accesses the whole S-box (4 cache line accesses). Then, for each cache line, extracts the chosen entry (byte) at the same cache line offset (so it vertically slices the S box table), and loads these bytes into a cache aligned buffer (so no cache side channel at this point either).

However, the authors can abuse 4K aliasing. Specifcally, an attacker that is a sibling hyperthread can cause false RaW dependencies to slow down the victim during execution (if the victim accesses 4K aliased S-box entries), which gives the attacker enough timing info to do a correlation attack (I need to read more about which correlation attack, but the authors have a journal paper). I think if you understand Tromer’s attack, it should be enough to get an idea of what happens.

Ok, so let’s check what you need to write a secure application:

Don’t write code that branches on secret state.
Don’t accidentally train a prefetcher. So if you access a table depending on a secret, always access each possible cache line in the table, and in the same order.
Don’t be vulnerable to 4K aliasing, lol⁸.

This is just a short list, but given I haven’t touched most side channels, it’s seems unreasonable for even experts at Intel to be aware of all possible side channels, let alone other devs.

The amazing thing is that at least for AES, D.J. Berstein seems to have predicted lots of cache timing issues that aren’t seen as leaks until much later (I could be wrong about this, but it seems so if you pay attention to security/architecture conferences). Go read his slides or the paper. Seems like maybe crypto implementers should probably take his advice, like avoiding S-boxes.

Anyway, secure crypto implementation should be assumed impossible in my view. Even if we one day have ASICS, that securely martial data into the crypto hardware, we still can suffer, because lots of algorithms are fragile. Maybe I’ll post about this in the future.

There are algos outside of crypto that can’t leak, such as oblivious algorithms. ↩︎
Not all bit leakages are equivalent. See https://eprint.iacr.org/2020/1506 for a bunch of examples of why this is true. ↩︎
If you want to know exactly what the operations do, go read https://eprint.iacr.org/2017/293 as a primer. When you don’t understand the group addition, go read https://jiggerwit.wordpress.com/2016/10/18/elliptic-curve-addition-without-tears/ or https://arxiv.org/abs/1610.05278. ↩︎
If you don’t have a cmov, or you have an algorithm that doesn’t have the correct symmetry requirements, you can always convert to branchless algorithm by calculating all paths, and using polynomial interpolation (which interpolates the set of points (path, path_f(input)), so your answer is at p(path,input)=path_f(input) ↩︎
Elliptic curves are defined over an underlying field. Squaring is more expensive once you leave the integers. ↩︎
It would help if someone found a really high resolution cache attack or an order of improvement to the orthogonal state of the art (check the Acknowledgment lol). ↩︎
If you like math, here’s a thought: even though we don’t get the sequence back, the oracle identifies the equivalence class of the sequence (where the relation is prefetching a particular line). The equivalence relation just happens be able to distinguish enough bits. ↩︎
How do you even do this? I guess demand everyone to turn off hyperthreading, or just avoid S-boxes. ↩︎