Categories
Cryptography

Kyber512’s security level

Sigh. I really didn’t want to have to write this blog post. There is a story going around, claiming that the NSA somehow unduly influenced NIST to choose Kyber over NTRU, despite Kyber not being secure enough. The story is based on this blog post, by famous cryptographer Daniel J. Bernstein (also known as djb), who happens to be one of the authors of the NTRU variant NTRU-prime, which was also competing in the NIST competition. This has confused a bunch of people leading to some asking for a writeup and explanation of what happened. I tried to procrastinate until the problem went away, but people keep asking, so here we go.

Summary

Kyber512 is not less secure than required for NIST security level 1. In an attack model that significantly favors symmetric schemes over asymmetric schemes, there is a small discrepancy of less than 6 bits, or a factor 64 that Kyber512 appears weaker than AES-128. When changing the model to be more realistic, the problem goes away. NIST did not make basic mathematical errors in their analysis.

Defining Security Levels

There are two pieces to this discussion, and to understand what’s at going on, we need to first talk about security levels.

Level 1: The applied definition:
A scheme has a security level of n bits, if it is as hard to break as an ideal block cipher with an n bit key on a hypothetical computer that can evaluate the block cipher once every cycle. In other words, security level n means an adversary has to use at least 2^n cycles to break the scheme.

This is a very useful definition, and it’s usually what we think of when talking about security levels. You can use it to make some back of the envelope calculations, by sticking in some tact frequencies and core numbers or maybe some energy cost and solar outputs, to convince yourself that 128 bit ciphers are probably good enough in many scenarios. It is also obviously deeply flawed: Ideal block ciphers don’t exist, no computer can evaluate a full AES block encryption in a single cycle, and no adversary would use a Turing complete, general CPU to brute force, when special purpose hardware would be far more powerful, not wasting time and energy on the control flow of a problem that doesn’t really need such a thing. Enter the gate model:

Level 2: NIST’s definition (paraphrased)
A scheme has a security level of n bits if it is as hard to break as AES-n

NIST’s actual definition is a bit more complicated, in that it also mentions SHA-2 in various levels, but since the discussion is only about NIST’s level 1, i.e. at least as hard as AES-128, we don’t really need to get into that.

At first, this definition doesn’t seem to be all that different from the first one, only replacing the ideal block cipher with AES. But this change makes a huge difference, since now we do not need a hypothetical computer, but can actually concretely compute the gate cost of the brute forcing attack. Which is what people did. The main paper people are arguing about is this one by Matzov. It comes with a convenient table:

Table with gate counts by Matzov

As you can see, when counting gates, AES-128 requires 2^143 gates, while the paper got the number for Kyber-512 down to 2^137.5. 137.5 is smaller than 143, so Kyber512 falls 5.5 bits, or a factor of ~50 short of what NIST required. This isn’t a lot.

Data locality

Now is the time to introduce the second concept I hinted at above: Data locality. If you have any experience with distributed computing, or even just micro optimizations, you have likely run into this: Even though on paper algorithm A should be faster than algorithm B, in practice algorithm B wins almost all of the time. A linear scan outperforms a tree map, which in turn outperforms a hash map, for example. There are usually two reasons for such a behavior: constant overhead, and data locality. I will focus here on the latter, although the former has also been cited as a counterargument to the claimed attack. The reason a linear scan can outperform a tree or a hash on small to medium size lookups is due to the CPU being able to stream in data from the RAM for the scan, making it so the data is available to be compared constantly, and we don’t have to waste cycles waiting for data to arrive from the RAM. One good way of visualizing this wait time is to calculate that at 3 GHz, data traveling at the speed of light will only make it 10 cm (~62 micromiles) per cycle. It doesn’t matter how fast your RAM is, if you cannot stream your data, you will be wasting several cycles for data to arrive, which in the case of a tree map or a hash collision can add up quickly. Of course this effect will not save the linear scan forever, and if you add enough elements, the asymptotic complexity will take over, but note that this is not merely a constant term effect either, as the required space of a circuit grows when scaling up. How we access data matters, and there are tons of tradeoffs to be had, for example whether to force your circuit to synchronize to allow for things like streaming data, but also forces you onto the same cycle based machine we wanted to avoid for extra efficiency above, etc. Since the length of wires and the cost having a synchronization signal travel through our circuit is not accounted for in the gate model, it makes sense that we might need to adjust values a bit for that, even more so when we start distributing algorithms over multiple machines, where these effects start to dominate in almost all practically relevant computations, as in it is usually more important where your data is than how many compute nodes you have to process it.

Now let’s quickly look at the data dependencies of our brute force algorithm. Quickly because there simply aren’t any. Brute force computations are lazily parallel, you simply divide the search space up arithmetically, and have completely independent brute forcing units happily brute forcing over their assigned chunk of work, until one of them is successful. In other words, the gate model will overestimate the cost of a brute force attack compared to any attack that requires data dependencies.

And, as it turns out, lattice reduction attacks have data dependencies. So much so that parallelizing lattice reduction on a single machine is non-trivial, and parallelizing it over multiple machines is an area of active research. Lots of numbers have been thrown around to adjust gate counts for these estimates, but it seems fairly reasonable that a factor of 50 is well within the margin of error here. In the end the difference in Matzov’s paper is small enough, that NIST’s confidence in Kyber512 still meeting NIST’s definition being well placed, and accepted by all but one of the cryptographers I know of. In particular, NIST’s analysis is very much careful, grounded in research, and does not contain basic math errors.

Closing thoughts

Lastly, my personal opinion on the technical question at stake here (I don’t care much for the personal attacks and polemic that has been thrown around). In my opinion, when you can afford it, you should be using Kyber768 instead of Kyber512. This is not due to any currently existing attacks, and I do not think NIST should change their planned standardizations and recommendations, but purely a precautionary step, and in situations where Kyber512’s performance and overhead is required, it is a solid choice. My reasoning to prefer Kyber768 comes from the fact that there are two types of attacks it might be facing:

  • Improvements of lattice reduction attacks
  • Algebraic attacks using the additional structure of MLWE over LWE

The former family of attacks is likely to occur, and likely to chip small amounts of the security level of Kyber, without fundamentally threatening the soundness of the algorithm itself. This is what we have seen happening to e.g. RSA in the case of the general number field sieve, where small improvements over time pushed RSA-1024 within reach of modern hardware to factor. Lattice reduction similarly has a tuning space and while the approach is unlikely to break Kyber, it might reduce the security level of Kyber512 below what I’m comfortable with. The second type of attack is far scarier, as increasing the security level slightly is not guaranteed to help much here, and I hope that such an attack does not exist. While it is possible to construct a Kyber640 variant relatively easily, in my experience, Kyber768 is pretty much always acceptable in practice, much simpler, and has a far more comfortable security margin. The same consideration holds for NTRU, which faces the exact same attacks, and I would aim for NIST level 3 there as well. Kyber scales quadratically in time, which is favorable to the cubic scaling behavior of elliptic curves, and makes bumping up the security level less painful.

Lastly, a few thoughts on comparing MLWE (Kyber) vs. NTRU more broadly.

From my perspective as a mathematician, MLWE is the more beautiful problem. This is a fairly useless observation, as mathematical beauty is fairly subjective, and it is not clear whether the more beautiful problem is more secure, but I cannot deny that I find the rather clunky feeling nature of NTRU somewhat offputting.

From my probably more relevant perspective as an applied cryptographer, MLWE is easier to implement and promises greater performance compared to NTRU at a very comparable security level. While these might seem like “nice to haves”, these considerations tend to matter greatly in practice, and are often what makes or breaks a cryptographic algorithm. The easier the algorithm is to implement correctly, the less I have to worry about side channel attacks bypassing the entire cryptosystem. In the end, attackers rarely try to brute force keys, but quite frequently exploit much more common problems in implementations. Performance and overhead are similarly important to the security of the overall system: If algorithms are too slow, they often end up being underused and put into more fragile, but less expensive protocol constructions. Any analysis that is based on a holistic threat model of the entire system will have to take these considerations into account, and from that perspective alone Kyber, including Kyber512, are the better algorithm choice. NIST has learned this lesson the hard way, with SHA3, despite being the better construction, struggling to gain on SHA2, due to having a far too conservative security margin and a CPU architecture unfriendly implementation.

Categories
Cryptography

Learning, but with Errors

Introduction

It’s now been several times that I had a conversation with some security folks that went something like this:

Them: “I wished this lattice stuff was easier, like RSA, I tried to read up on it but it was just so complex”
Me: “I think RSA is actually a bit more complicated than Learning With Errors, the didatics on it are just not as far along”
Them: *Pressing X to doubt*
Me: “Fine, I’ll explain it, against my will. I am just going to note here that I am basically forced into this and derive no joy whatsoever from explaining lattice cryptography.”

So after having done that a few times, I figured I should put some of it into a more scalable, written form.

Prerequisites: A bit of Modular Arithmetic and a bit of Linear Algebra.

So what on Earth is a lattice?

A lattice is a free, finitely generated \mathbb{Z}-module usually embedded into a finite dimensional real vector space.

That’s a bit of a mouthful, so let’s break that down a bit.

The term module is just mathematicians being fancy and using a different word for “vector space, but over a ring” (In case you don’t know what a ring is, for this blog post, just replace that term with \mathbb{Z}, the integers. It’s a numbery thing where you can add, subtract, and multiply, but not necessarily divide. In case you don’t remember vector spaces, it’s where vectors live. You can add vectors, there is a zero vector, and you can multiply vectors with elements of the base ring/field to scale them).

Finitely generated means that there is a finite number of elements that generates it, in other words every element x in our module can be written as x = \sum_{i=1}^n a_i g_i, with a_i \in \mathbb{Z} and g_i being our finite set of generators.

This leaves us with free. A module is free, when it actually behaves like a vector space and doesn’t do any weird ring shit. Modules love doing weird ring shit, so this qualifier is unfortunately very necessary. More formally, it means that our module has a basis, i.e. that we can choose the g_i so that the sum above is unique. In that case, any such set will have the same length, which we call the rank of the module, because calling it the dimension would make too much sense and we can rewrite the module to be \mathbb{Z}^n. In other words, for any ring R all free, finitely generated R-modules are isomorphic to R^n.

So we have some integer vectors, and we want to embed them into a real vector space. To avoid some technicalities, I will only talk about so called full rank lattices, where the lattice rank is the same as the dimension of the vector space. I will also not go into the technical definition of embedding, it’s a fairly decent word as math words go and means pretty much that, the lattice is placed inside the vector space.

One thing to note is that there is more than one way to embed \mathbb{Z}^n into \mathbb{R}^n. Let’s take n=2 as example, so we can draw some pictures.

The lattice \left\langle\left(1, 0\right),\left(0, 1\right)\right\rangle

The simplest way we can embed \mathbb{Z}^2 into \mathbb{R}^2 is by just mapping things trivially and looking at all points with integer coordinates. Unsurprisingly, we get a grid.

The lattice \left\langle\left(1, 0\right),\left(\frac12,\frac{\sqrt{3}}{2} \right)\right\rangle

If we want to be a tad bit more creative, we could try this hexagonal pattern. We can immediately tell that these two lattices are materially different. For example, in the first lattice, every lattice point has four closest neighbors, while in the second lattice we have six closest neighbors. As every board game nerd will tell you, the difference matters. A lot.

There are far more ways to create 2-dimensional lattices. So many in fact that you can prove all sorts of things about elliptic curves just by looking at 2d lattices, but that is a different blog post.

So to recap:

  • A lattice is a copy of the \mathbb{Z}^n, embedded into the \mathbb{R}^n.
  • Lattices have bases. They behave nicely, and don’t do weird ring shit.
  • The way the lattice is embedded matters.

Hard problems in lattices

While lattices are fascinating by themselves, we want to do cryptography, so we need to have hard problems. The two hard problems we will look at are called the Shortest Vector Problem (SVP) and the Closest Vector Problem (CVP).

The Shortest Vector Problem is the question “What vector other than 0 is the shortest vector in this lattice”. The zero vector is in all lattices and has length 0, so if we didn’t exclude it it would not be a very interesting problem.

The Closest Vector Problem on the other hand answers the question “Given a vector x, not necessarily in the lattice, which point in the lattice is the closest point to this vector?”

Answering (pink) the CVP for the vector \left(\frac53, \frac34\right)(red)

Good Basis, Bad Basis

At first, it might seem strange that the CVP and the SVP are considered hard problems. After all, you probably didn’t need to even get out a calculator or measuring tape to solve that example CVP, and I didn’t even bother to prove that 4 neighbors, 6 neighbors stuff which is basically the SVP, so what is supposed to be so hard about this?

The problem is, we only looked at this problem in 2 dimensions. But for lattice cryptography, we use something more like 768 dimensions. Unfortunately, graphics cards have trouble rendering those, so we will have to build some intuitions with numbers.

In the example we found that the closest vector to \left(\frac53, \frac34\right) is the lattice point (2, 1). How did my computer calculate this closest vector? Well I told it to round \frac53 to 2 and \frac34 to 1. Exciting stuff.

Let’s look at a completely, totally different lattice, the one generated by \{(2, 3), (5, 7)\}. If we now want to know what is the closest lattice point to (\frac53, \frac34), we first need to change basis to our lattice basis, round, and change back to the standard basis. We see that (\frac53, \frac34) = -\frac{95}{12}\cdot(2, 3) + \frac72\cdot(5, 7), rounding gives us -8\cdot(2,3)+4\cdot(5,7) = (4, 4). Let’s plot this:

The starting point (red) and the “closest” point (pink)

Wait, what? First of all, it turns out, this completely, totally different lattice is the same lattice as before, because (1,0)=-7\cdot(2,3) + 3\cdot(5,7) and (0, 1) = 5\cdot(2,3) -2\cdot(5,7). Turns out, I “accidentally” just used a different basis of the very same lattice. But why did our rounding trick not work? Let’s work backwards: Our solution should be (2,1) = -9\cdot(2,3)+4\cdot(5,7). So we almost got there, but rounding was still off by one in the first coordinate. In 2 dimensions, that’s not a big deal, just brute force by adding or subtracting 1 from each coordinate and see if you find a closer point. Of course, if the dimension is something like 768, brute forcing isn’t going to help.

This means that there are good bases and bad bases. In a good basis, we can just round the coefficients to solve the closest vector problem. In a bad basis however, if the dimension is high enough, you are hopelessly lost. And it turns out the bad bases outnumber the good ones, and if you have a good basis and want a bad one you just randomly base change a bit. (The example basis was literally my first try).

Now there is a way to go from a bad basis to a good basis, with a method called lattice reduction. But this blog post is already too long, so I won’t go into that. Suffice to say it’s hard to get a good enough basis in high dimensions.

A problem that is hard, but only if you don’t know some secret? This is very promising, let’s do some cryptography!

And how do I encrypt with this?

With all that lattice knowledge, we can dive right into learning with errors.

As parameters, we need a prime q, and lattice rank n. Popular choices are primes like 3329 and ranks like 768. One somewhat sour note is that we will unfortunately only encrypt a single bit.

To generate a key, we need a matrix A, where we take the coefficients uniformly at random between 0 and q. This matrix could theoretically be a public parameter, but choosing it per key makes it impossible to attack all keys at once.

The next thing we need are two vectors, s and e, with small coefficients. And by small we usually mean something like -1, 0, or 1. Maybe the occasional 2, but not larger. We then compute t = A^Ts + e \mod q and make (A, t) the public key, keeping s for the secret key.

To encrypt, we will need two more vectors r and f, also with small coefficients and a small integer g. We kind of do the same thing as with key generation and compute u = Ar + f \mod q as well as v = t^Tr + m\cdot\frac{q-1}{2}+g \mod q. To build some intuition let’s look at what size those terms have. Since we multiplied with a bunch of uniformly distributed stuff, t and u are also uniformly distributed. g as discussed is small. m\cdot\frac{q-1}{2} is either 0, a number generally considered fairly small, or \frac{q-1}{2}, a number which, modulo q is the largest possible (we allow negative numbers, so if we make it any bigger it just wraps around). Since t was uniformly distributed, t^Tr is as well. Which is good, since that means it masks whether our message term is big or small, and with only a single message bit, we should better try protecting it.

So how do we decrypt? Easy, just compute m' = v - s^Tu \mod q and output 0 if it’s small and 1 if it’s large. Since

\begin{align*}m' &= v - s^Tu\\ &= t^Tr + m\cdot\frac{q-1}{2}+g - s^T(Ar + f)\\ &= (A^Ts+e)^Tr - s^T(Ar + f) +m\cdot\frac{q-1}{2}+g\\ &= s^TAr - s^TAr + m\cdot\frac{q-1}{2} + e^Tr-s^Tf +g\end{align*}

The first two terms cancel. The last three terms are small, more precisely not larger than 2n. So m' is large if and only if m was 1, at least as long as we choose our parameters so that the rank is not too large compared to the prime.

Cherchez la lattice

So you might say, that’s nice and all, but where is the freaking lattice? All that stuff about closest vectors and bad bases and so on, it doesn’t look like we actually used any of it. And you’d be right. Lattices are kind of weird in that they are everywhere in arithmetic number theory, but they like to hide a bit sometimes. So let’s find this lattice, by attempting to break this scheme.

We want to recover the secret key s from the public key A, t = A^Ts+e \mod q. Let’s do this in two steps. First, we try to find any vectors s' and e' that solve t=A^Ts'+e'\mod q, without caring whether or not they are short, and then we can try to modify them to be short.

We can rewrite our equation as (A^T, \operatorname{Id}, q\operatorname{Id})(s', e', k) = t with k being some integer vector that we just use so we don’t have to worry about the \mod q part. This is solving a linear equation over the integers. It’s not quite as straightforward as the standard Gaussian algorithm, but a small modification by Smith will solve it and give us both a special solution (s', e', k) and a basis of the kernel of (A^T, \operatorname{Id}, q\operatorname{Id}). Recall that any solution, including the short s and e we are looking for differ from s' and e' by an element of the kernel.

Hmm, looking more closely at that kernel, it is a \mathbb{Z}-module by virtue of being a kernel, it has a basis, so it is free, and the basis is finite, so it is finitely generated. We don’t really care about the last n coefficients though which we only needed for the modulo, but we can just drop them, and the result remains a free, finitely generated \mathbb{Z}-module. Given that we want to find short vectors, and our notion of shortness was just the normal Euclidean norm, we can embed this kernel into \mathbb{R}^{m} trivially, and have found a lattice. In fact, we have found the lattice, since with a good basis we could find the vector (s_0, e_0) closest to (s', e') in the kernel lattice, which means their difference would be short, and indeed it would be (s, e) (even if it was not, what the only property for the decryption algorithm to work is the s and e are short, so it would just decrypt).

Smith unfortunately gives us a very large special solution and kernel basis vectors that very much are a bad basis of the kernel, which means that we won’t be able to find this closest vector, and this property gives Learning With Errors its security.

Closing Remarks

Since we have encrypted a single bit, if we used the scheme as is, it would be incredibly wordy, since we would need to repeat all of this 128 times (or more) to actually transmit a key. Kyber (or ML-KEM in NIST terms) uses some trickery to both make keys and ciphertexts smaller and operations faster, but I won’t go into them in this blog post.

Also, just like textbook RSA, the algorithm I presented here would not be secure in interactive settings, and needs additional modifications to ensure that we cannot extract the secret key by sending several related messages and observing whether the decrypting party use the same or a different key. But this too would need a separate blog post.

Categories
Infrastructure Security

TLDR: The Value of a Small but Visible Investment in Infrastructure Security

A short letter to our non-infrastructure colleagues

It’s reasonable for infrastructure security to be only a small investment in a consumer product business among our many priorities. But take care to create and keep a small space for raising visibility of infrastructure risk. InfraSec is among the least likely categories of risk — while also being among the most impactful types of incidents!


To oversimplify:

  • One abusive user could Rickroll everyone around them.
  • One appsec exploit could spam Rickrolls to every player everywhere.
  • One infrasec exploit could replace all game assets with Rickrolls — and delete all server-side game data, and steal personal information and payment data from players and employees, and deploy a botnet, and bury a rootkit/surveillance script/C2 in the hope that defenders will miss cleansing it with fire.

Is the infrasec exploit likely to happen frequently? No—and abusive users are inevitable, at a far higher rate.

brown wooden bridge over green trees

InfraSec isn’t “the most important” or “the most impactful” or “the highest risk”. InfraSec provides a collection of critical foundation layers among our portfolio of security approaches for defense in depth.

InfraSec is the reason why a single AppSec vuln exploit doesn’t turn into a data breach — e.g., by restricting lateral movement, or by preventing unauthorized egress, or by isolating a minimally privileged container/VM to limit the blast radius of a web app vuln.

AND — Appsec is the reason why we can accept taking down an InfraSec control while we fix an issue, or why we can choose a weaker InfraSec control in a middle layer as a tradeoff for better performance, while strengthening the surrounding controls.

It’s always a challenge to balance the right investment in infrastructure, whether it’s SREs, platform engineering, infrastructure security, or the many other folks that support our infra! Once you make that investment in InfraSec, also find the balance for when and how much InfraSec is part of the conversation on risk assessment and on risk mitigation.

Categories
Cryptography Security Stupid TLS Facts

Stupid TLS Facts: TLS Resumption

DID YOU KNOW that TLS doesn’t actually have perfect forward secrecy in most circumstances?

Rather, TLS has “conditional future imperfect secrecy” because it would have had perfect forward secrecy without resumption.

a shattered brass metal lock on brown wood

“TLS Session Resumption” is the TLDR of the myth of forward secrecy in TLS.

SSL handshakes are expensive, requiring multiple round trips for key exchange and expensive computations. Enter TLS resumption. With TLS resumption, a session ticket enables both client and server to jump back into their shared session with a single round trip (plus the SYN/ACK round trip) — and the resumption key derivation is a tenth of the cost of the full key exchange computation (per Cloudflare’s resumption performance data).

However, that optimization comes at the cost of forward secrecy: compromising the stored session ticket compromises any past sessions (as well as future sessions, which is already assumed in the TLS threat model, since theft of TLS private key material will compromise any future sessions using that key).

Forward secrecy (or “perfect forward secrecy”, PFS) is the assurance that compromising a system at time T won’t cause a compromise of any sessions/messages before time T.

Once a system is compromised, forward secrecy does not provide assurances about future sessions, which are assumed to be compromised until secrets & keys are rotated (and attacker is removed, etc).

How does TLS provide forward secrecy when resumption isn’t enabled? Through key exchange! The foundation of confidentiality in TLS is through a sequence of cryptographic operations in which the client and server share public values that they each combine with their private values to produce their shared secret (which is cryptographically infeasible for observers to derive from only the public values).

How does resumption violate forward secrecy? If the attacker, Eve, compromises the server after a TLS session, they would have the resumption ticket and the TLS private key. The TLS private key enables Eve to impersonate the server. The resumption ticket enables Eve to breach the confidentiality of past recorded TLS sessions. (Notwithstanding the caveats in the last section.)

TLS Resumption: Stateful & Stateless

TLS resumption comes in 2 flavors, stateful and stateless. The abbreviated handshake for TLS resumption is an extension to the typical ClientHello & ServerHello, enabling the abbreviated handshake to fall back to the full handshake if either side doesn’t have the resumption data.

In stateful resumption, server designates a session ID and both the client server store that ID with the shared secret derived in key exchange. The client can then begin in the ClientHello by including a session ID along with the usual ClientHello (cipher suite, random number, etc). ServerHello response also includes the session ID with the usual ServerHello (cipher suite, random number, etc). Each side combines those random numbers are combined with the shared secret corresponding to the session ID to derive a new shared session secret.

Stateful resumption requires a connection with the same single “server” holding onto those session details, which are usually cached in memory for up to 24 hours.

The plaintext session information or session ticket IS the shared symmetric encryption key (often called the “master secret”) used to encrypt the first TLS session between a client and a server.

Would it be wise to store a different key so that past sessions are protected by forward secrecy? Yes.

Is that how session resumption was bolted onto SSL? Nope!

In stateless resumption, the server stores session details (protocol version, cipher suite, shared secret, and client identity) in an encrypted session ticket, and this session ticket is stored by the client. The client can resume the session by sending the session ticket in the ClientHello.

The session ticket is encrypted by the Session Ticket Encryption Key (STEK), which is only known to the server, and is typically rotated every 1-24 hours (but can be arbitrarily longer, per RFC 5077). Thus, the server only needs to hold on to the STEK, and the client holds onto the session ticket along with the shared secret.

See examples of session resumption packets.

Compromises Are Made

The false promise of TLS’s claim to forward secrecy was that a server compromise would not compromise the confidentiality of previous TLS sessions. However, with TLS resumption, the forward secrecy of past sessions is bound to the lifetime of the session information or Session Ticket Encryption Keys held on the server.

Compromising a server holding session information hands the shared session keys directly to the adversary, Eve. In the case of session tickets, if EVE gets hold of the STEK, they can decrypt the encrypted session ticket to get the same session information. The session information encrypted by the STEK is directly the previous session’s shared symmetric encryption key, and thus Eve can decrypt the encrypted messages of the TLS session.

Either case also requires the adversary to observe and record past sessions (from handshake onwards) — but recall that the maintaining confidentiality of all previous recorded sessions in the event of a compromise is exactly the assurance of perfect forward secrecy.

TLS 1.3 & 0-RTT

TLS 1.3 supports a resumption mode called “0-RTT” (zero round trip time). With 0-RTT, a client can use a Pre-Shared Key (PSK) to send an encrypted HTTP request with the very first message (after SYN/ACK). This is called “early data”.

Because the PSK is used directly as the symmetric encryption key, the early data transmitted by the client is not protected by forward secrecy (much like session tickets in earlier TLS versions). However, the client may start (EC)DHE key sharing in that first message, enabling the server to complete key exchange and send a response that IS protected by forward secrecy. If an attacker breaches the STEKs and has a (recently) recorded message, they can decrypt the PSK and then decrypt the early data — but starting from the server’s response, the attacker will not be able to decrypt later messages that are encrypted using the new shared key derived from (EC)DHE.

If you’re interested in digging into the details of how the TLS 1.3 handshake achieves this, I recommend this C3 talk by Filippo Valsorda and Nick Sullivan.

This sounds great! TLS 1.3 delivers forward secrecy with TLS resumption! Rejoice! Alas, briefly — 

In this document, the keywords "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", and "MAY" are to be interpreted as described in RFC 2119.
https://datatracker.ietf.org/doc/html/rfc2119

That is, the client MAY start (EC)DHE key sharing next to the PSK-encrypted HTTP request in the TLS 1.3 ClientHello. The reality of whether this is applied in practice is…inconsistent.

This (along with systemic oppression) is why we can’t have nice things.

Implementations can choose to require that clients perform key exchange (either in resumption or as a brand new session). For example, BoringSSL requires fresh key material in TLS 1.3 resumption.

0-RTT Replay Attack

Without compromising server nor client, an attacker can replay a 0-RTT message from a client. While the message’s confidentiality and integrity are protected by the PSK, a replay attack can cause a server to perform the previously requested action again, potentially under different circumstances — e.g., “undo the last action” or “open this door” [for the attacker]. Thus, 0-RTT requests should be idempotent (which by definition means “produce the same result regardless of how many times this function is applied”).

Conclusion

You might have wondered when session resumption “officially” became part of SSL or TLS. This is hard to trace, because the practice of storing session keys and using the SessionID in ClientHello extensions for session resumption dates back earlier than the 1996 SSL 3.0 protocol, as captured for historical record by RFC 6101.

Papers analyzing SSL 3.0 refer to session resumption in SSL 2.0. E.g., in Wagner & Scheier’s “Analysis of the SSL 3.0 Protocol”, they specifically call out rollback attacks involving session resumption. The paper finishes with a specific callout around the sensitivity of the master secret:

Ensuring that the master secret remains truly secret is tremendously important to the security of SSL. All session keys are generated from the master secret, and the protection against tampering with the SSL handshake protocol relies heavily on the secrecy of the master secret. Therefore, it is important that the master secret be especially heavily guarded.

Wagner D, Schneier B (1996) Analysis of the SSL 3.0 protocol. In: The Second USENIX Workshop on Electronic Commerce Proceedings, vol 1, no 1, pp 29–40
https://www.usenix.org/legacy/publications/library/proceedings/ec96/full_papers/wagner/wagner.pdf

That being said, forward secrecy was just barely on the radar in the mid-1990s, and primarily from the perspective of compromising months and years of confidential messages after a private key is compromised. The idea of remotely (or physically) compromising a server in order to compromise the confidentiality of a handful of recent sessions was not a relevant threat at that time.

Even today, compromising TLS session resumption tickets remains a negligible threat in the overwhelming majority of threat models.

Welcome to Stupid TLS Facts! If you particularly enjoyed/hated this, why not share it with your friends?

Appendix: Nuances are made

Threat Modeling Conditional Forward Secrecy in TLS

Is the limited nature of the conditional forward secrecy provided by TLS relevant to Alice & Bob? Let’s examine a few scenarios that could be impacted by compromised TLS session information within 24 hours after these events:

  1. Alice browses Wikipedia: No confidential data.
  2. Alice receives an email from her mother: Potentially confidential information. Alice would be surprised by a breach of that confidentiality. Alice probably isn’t a target of an attacker. It’s up to Alice to evaluate her own threat model, but this is almost definitely not a priority.
  3. Alice reads Bob’s free speech and right to privacy blog: No confidential information. Alice should explain to Bob that freedom of speech is a right to expression without fear of government retaliation or censorship, give him 1984 by George Orwell, and teach him how to assess his threat model.
  4. Alice is a CEO of a major corporation who receives an email from her CFO: Likely confidential information. Alice would be surprised by a breach of that confidentiality. Alice is probably the target of an attacker. Could have moderate impact beyond Alice. Alice’s security org might evaluate how their critical services such as email providers configure TLS resumption — but any attacker that is interested in Alice’s corporation probably has more impactful targets in reach if they manage to get resumption keys (either at that corporation or at an email provider) — such as breaching the mail database itself. High effort and high risk for a “moderate” reward at best.
  5. Alice is a CEO of a major corporation whose customers have high expectations for confidential interaction with their services: Alice’s security org should definitely take care in evaluating resumption (and all) key management practices regularly, as well as configuration of TLS resumption. (E.g., BoringSSL requires fresh key material in TLS 1.3 resumption.)
  6. Alice is a leader of a Nation State who receives a confidential email from another Nation State leader, Bob: Definite breach of sensitive confidential information. Alice should definitely know about this risk and work to avoid it. Nation States are definitely high profile targets for many attackers, with potentially broad and severe impact for a breach of confidentiality. Alice is undoubtedly already using protocols supporting backward secrecy (AKA self-healing secrecy providing post-compromise confidentiality), such as Signal’s double ratchet algorithm.
  7. Alice is organizing protests against human rights violations: Alice should absolutely be relying on protocols supporting end to end encryption, full perfect forward secrecy, self-healing secrecy, and more — see the Electronic Frontier Foundation’s guide to Surveillance Self-Defense.

TLS cipher suites lacking forward secrecy

TLS 1.3 only supports cipher suites that provide forward secrecy (aside from the resumption mode without key exchange). Before TLS 1.3, some allowed cipher suites could not assure forward secrecy. These were allowed to enable compatibility

In TLS 1.2, cipher suites without forward secrecy were marked as not recommended. Notably, RSA key exchange is deprecated.

See additional detail in a recent draft RFC by John Preuß Mattsson, “NULL Encryption and Key Exchange Without Forward Secrecy are Discouraged“. This RFC has some excellent references as well, including “Pervasive Monitoring Is an Attack” (RFC 7258).

TLS 1.2 ClientHello & Random Numbers

In TLS 1.2, the client and server each send new random numbers to derive a new shared secret for each session resumption. How is this different from TLS 1.3? Because in TLS 1.2, these random numbers are hashed together with the shared secret to create the new shared secret (an inexpensive operation), whereas in TLS 1.3, these new random values are used in (EC)DHE to derive a new shared secret (a relatively more expensive operation). (c.f., RFC 4346 Security Considerations).

Modern Architecture & Key Management

The reality of modern server architecture is that you’ll have a collection of servers—and more likely, a collection of load balancers terminating TLS at the “edge” of your cloud environment. The days of deriving an encryption key held only in memory by a single server are over: more likely, a set of STEKs will need to be managed across a collection of servers, so that any given server can decrypt the session tickets that another server created for that client.

As recommended by Adam Langley, these session ticket keys should be

  1. Randomly generated and distributed (probably regionally)
  2. Rotated frequently (to limit how severely forward secrecy is compromised)
  3. Held in memory and never written to persistent storage (because the only method to delete data with high assurance is to destroy the storage medium following NIST 800-88 media sanitization guidelines)

TLS 1.3 External PSKs

Earlier, I described Pre-Shared Keys in TLS 1.3 as a value provided by the server at the end of the TLS 1.3 handshake. However, as implied by the PSK naming, TLS 1.3 supports a client designating and using a Pre-Shared Keys in exactly the same way as PSKs are used in IPsec or WPA.

RFC 9257 (“Guidance for External Pre-Shared Key (PSK) Usage in TLS“) addresses the security properties provided by PSKs and the sharp edges around designing key distribution for PSKs.

Session Resumption Predates RFCs

You might have wondered when session resumption “officially” became part of SSL or TLS. This is hard to trace, because the practice of storing session keys and using the SessionID in ClientHello extensions for session resumption dates back earlier than the 1996 SSL 3.0 protocol, as captured by RFC 6101.

Papers analyzing SSL 3.0 refer to session resumption in SSL 2.0. E.g., in Wagner & Scheier’s “Analysis of the SSL 3.0 Protocol”, they specifically call out rollback attacks involving session resumption. The paper finishes with a specific callout around the sensitivity of the master secret:

Ensuring that the master secret remains truly secret is tremendously important to the security of SSL. All session keys are generated from the master secret, and the protection against tampering with the SSL handshake protocol relies heavily on the secrecy of the master secret. Therefore, it is important that the master secret be especially heavily guarded.

Wagner D, Schneier B (1996) Analysis of the SSL 3.0 protocol. In: The Second USENIX Workshop on Electronic Commerce Proceedings, vol 1, no 1, pp 29–40
https://www.usenix.org/legacy/publications/library/proceedings/ec96/full_papers/wagner/wagner.pdf

That being said, forward secrecy was just barely on the radar in the mid-1990s, and primarily from the perspective of compromising months and years of confidential messages after a private key is compromised. The idea of remotely (or physically) compromising a server in order to compromise the confidentiality of a handful of recent sessions was not a relevant threat at that time — and even today it remains a negligible threat in the overwhelming majority of threat models.

Categories
Cryptography

Invisible Salamanders in AES-GCM-SIV

By now, many people have run across the Invisible Salamander paper about the interesting property of AES-GCM, that allows an attacker to construct a ciphertext that will decrypt with a valid tag under two different keys, provided both keys are known to the attacker. On some level, finding properties like this isn’t too surprising: AES-GCM was designed to be an AEAD, and nowhere in the AEAD definition does it state anything about what attackers with access to the keys can do, since the usual assumption is that attackers don’t have that access, since any Alice-Bob-Message model would be meaningless in that scenario.

What is interesting to me is that this property comes up more often than one would think, I ran across it several times now during my work reviewing cryptographic designs, it’s far from an obscure property for real world systems. The general situation these systems have in common is that they involve three parties: Alice, Bob, and Trent. Trent is a trusted third party for Bob, who is allowed to read messages and scan them, with details like when and why depending on the crypto system in question. While Trent and Bob agree on the ciphertextsay because Trent hands Bob the ciphertext or because Alice presents Trent’s signature on itAlice has the option of giving Trent and Bob different keys. The challenge for Alice is to come up with a ciphertext that has a valid authentication tag and still decrypts to different messages for Trent and Bob.

Mitigations

Before I dive deeper into how to construct invisible salamanders for AES-GCM and AES-GCM-SIV, a few words on how to defend against these problems. The easiest option here is to add a hash of the key to the ciphertext. This technically violates indistinguishability, as the identity of the key is leaked, i.e. an attacker now knows which key was used for the message. If indistinguishability is necessary, using the IV as a salt for the hash works well, constructions like HMAC-SHA-2(key=IV, message=key) (i.e. aka HKDF-expand) work well here, as long as attention is paid on whether or not this key hash can be used in any other context. In general, it shouldn’t because the key already should only be used for AES-GCM/AES-GCM-SIV, but real world systems sometimes have weird properties.

Constructing Salamanders

With the mitigation out of the way, onto the fun part: Constructing the messages. In order to understand why and how these attacks work, we first have to talk about \mathbb{F}_{2^{128}} and the way AES-GCM and AES-GCM-SIV use this field to construct their respective tags. As a finite field \mathbb{F}_{2^{128}} supports addition, multiplication, and division, following the usual field axioms. The field has characteristic 2, which means addition is just the xor operator, and subtraction is the exact same operation as addition. Multiplication and division is somewhat more complicated and not in scope for this article, it suffices to say that multiplication can be implemented with a very fast algorithm if the hardware supports certain instruction sets (carryless multiplication). The division algorithm uses the Euclidean algorithm and will at most take 256 multiplications in a naive implementation, so while slower than the other operations, it will still be extremely fast. I will use + for the addition operation and \cdot for the multiplication operation. The most important caveat is to not confuse these operations with integer arithmetic.

AES-GCM

Next, on to AES-GCM. This AEAD is a relatively straightforward implementation of an AEAD that uses a UHF based MAC for authentication. Our IV is 12 byte long, we use a 4 byte counter and CTR mode to encrypt the message. The slightly odd feature is that we start the counter at 2, for reasons we will see later. For authentication, we first derive an authentication key H by encrypting the zero block (This is why we don’t start the counter at zero, otherwise the zero IV would be invalid). Now, using the ciphertext blocks, additional data blocks (both padded with zeros as needed for the last block), and adding a special length block containing the size of the additional data and the ciphertext, we get a collection of blocks, all of which I will refer to as C_i. To compute the tag, we now compute the polynomial

GHASH(H, C, T) = C_0\cdot H^{n+1} + C_1\cdot H^{n}+\dots + C_{n-1}\cdot H^2+C_n\cdot H+T

The constant term, T is the encrypted counter block associated to the counter variable of 1 (Which is why we started at 2 for the CTR mode). Remember that in characteristic 2 + is xor, so we could equivalently say that we compute the polynomial without the constant term and then encrypt it with CTR mode as the first block.

Now, how do we get two different plaintexts to agree on both ciphertext and tag, we first choose two keys and produce the corresponding keystreams, choosing the plaintexts so that the ciphertexts agree (If you want two plaintext that make sense, this part is the hardest step, you first brute force the first few bytes in order to be valid in one format and a comment opening statement in the other, so that you can switch which parts of the ciphertext will appear as valid plaintext and which parts appear as commented out). We leave one ciphertext block open for now, as a sacrificial block that we will modify in order to make the tags turn out to be the same. Next derive the corresponding authentication keys H_1 and H_2 and our constant terms T_1, T_2. This means, we have C_i fixed, except for a specific index, say j, and can now solve

GHASH(H_1, C, T_1) =GHASH(H_2, C, T_2) \sum_{i=0}^n C_i\cdot H_1^{n+1-i}+T_1=\sum_{i=0}^n C_i\cdot H_2^{n+1-i}+T_2 C_j\cdot\left(H_1^{n+1-j}+H_2^{n+1-j}\right)=\sum_{\substack{i=0\\i\neq j}}^n C_i\cdot \left(H_1^{n+1-i}+H_2^{n+1-i}\right)+T_1+T_2 C_j=\left(H_1^{n+1-j}+H_2^{n+1-j}\right)^{-1}\cdot\left(\sum_{\substack{i=0\\i\neq j}}^n C_i\cdot \left(H_1^{n+1-i}+H_2^{n+1-i}\right)+T_1+T_2\right)

by solving for the sacrificial block C_j.

AES-GCM-SIV

So far so good, but, what about AES-GCM-SIV? GCM is famous for having many weird properties that make it extremely fragile, like leaking the authentication key on a single IV reuse, or allowing for insecure tags smaller than 128 bits. In many ways, AES-GCM-SIV is how AES-GCM should look like for real world applications, much more robust against IV reuse, only revealing the damaging properties of an UHF with a reused IV if both IV and tag are the same. This is accomplished through using the tag as a synthetic IV, meaning the tag is computed over the plaintext, and then used as IV for CTR mode to encrypt. Even though this kind of SIV construction uses MAC-then-Encrypt, they are secure against the usual downsides due to CTR mode always succeeding in constant time, independent of the plaintext. This means the receiver can decrypt the message and validate the tag without revealing information about the plaintext in case of an invalid tag. The library needs to take care that the plaintext is properly discarded and not exposed to the user in case the tag does not validate.

The actual IV for AES-GCM-SIV is used primarily derive a per message key. This means that if the IV of two messages is different, both encryption and authentication keys will be unrelated and can not be used to infer things about each other.

All in all AES-GCM-SIV works like this:

  • H, K_E = \operatorname{KDF}(K, IV)
  • T=\operatorname{AES}(K_E, P_0\cdot H^{n+1}+\dots+P_n\cdot H)
  • C=\operatorname{AES-CTR}(K_E, IV=T)

where the plaintext blocks P_i again contain additional data and length, and some extra hardening and efficiency tricks having been stripped for clarity.

Our previous approach of first creating the ciphertext and then balancing things out to get the tags to agree clearly cannot work here anymore. The keystream, and therefore the ciphertext, depend on the tag, so if we want to have any chance of finding a salamander, we have to fix the tag before we do any calculation at all. So after having chosen T, we decrypt it under each of our keys to get the result of our polynomial S_i=\operatorname{AES}^{-1}(K_{E,i}, T). What we are left with is finding plaintexts P_1, P_2 such that

S_i=\sum_{j=0}^n P_{j, i} H_i^{n+1-j}

which gives us a system of two linear equations with 2n unknowns. But this isn’t all constraints we need to satisfy, since we still need to encrypt these plaintexts once we have the tag balanced. Here, we are lucky that everything is over characteristic 2: The CTR encryption is just an addition of the plaintext and the encrypted counter block C_i=\operatorname{AES}(K_E, CB_i)+P_i. To say that two plaintexts result in the same ciphertext under two different keys is just fulfilling the equation

\operatorname{AES}(K_{E, 1}, CB_{j, 1})+P_{j, 1}=\operatorname{AES}(K_{E, 2}, CB_{j, 2})+P_{j, 2}.

This, like our two equations for the tag, is a linear equation. So in the end, for a plaintext that has a size of n blocks, we get n+2 linear equations with 2n variables. This means, in almost all cases, we can construct an invisible salamander with only adding two sacrificial blocks, with the same caveat that the two plaintexts need to be partially brute forced.

Test Code

I’ve put this to the test and have written code to produce AES-GCM (Java) and AES-GCM-SIV (C++) salamanders.

Categories
Cryptography (Incomprehensible)

Cartier Divisors

As an obvious first blog post, easily understandable and very relevant to cryptography (/s), here a description of Cartier Divisors, because Thai asked for it on Twitter.

For this, first some history: A while ago, I taught a Google internal course about the mathematics of elliptic curves. It would probably make sense to start with that content, but I’m going to assume that I’ll come back to it and fix the order later.

Anyways, the objects we are looking at are Divisors and Principal Divisors. The come up when studying curves as a way to describe the zeroes and poles of functions. Over the projective line \mathbb{P}_K^1 (also known as just the base field plus a point at infinity), a rational function (the quotient of two polynomials) can have any selection of zeroes and poles it so pleases, with the only constraint being that there must be (with multiplicity) the same number of zeroes and poles. We can see that by looking at

\frac{(X-a_1)(X-a_2)\dots (X-a_n)}{(X-b_1)(X-b_2)\dots (X-b_n)}

for a function with zeroes at a_1, a_2, \dots, a_n and poles at b_1, b_2, \dots, b_n. If a_i or b_i is \infty, then we ignore the corresponding term, and get a zero/pole at infinity.

On more general curves, we do not have this amount of freedom. The lack of freedom we have in choosing zeroes and poles is tied surprisingly deeply to the curve in question, so describing it turns out to be very useful.

A Weil divisor is a formal sum of points of the curve, that is, we assign an integer to every number of the curve, with all but finitely many points getting assigned the integer zero. The degree of a divisor is the sum of all these integers. The divisor of a function \operatorname{div}f is the divisor we get by assigning the order of the function in that point to the point, i.e. setting it 1 for simple zeroes, -1 for simple poles, and so on. If a divisor is a divisor of a function, we call the divisor a principal divisor.

With these definitions out of the way, we can get to Thai’s question. It turns out that the thing we are interested in is the divisors of degree 0 modulo the principal divisors. This group in some sense measures how restricted we are in our choice for placing zeroes and poles. It turns out, that for Elliptic curves, all divisors are equal to a divisor of the form P - O, with O being the point at infinity (or really any fixed (“marked”) point on the curve) up to a principal divisor (equal up to principal divisor is also called linearly equivalent). So what Thai is asking is that while we can think of principal divisors as a description of rational functions, what are the other divisors? The simple answer to that is that they are just what we said, formal sums of points, just some points with some integer weights. For elliptic curves, they are conveniently in a 1:1 correspondence with the points of the curve itself, which is why we usually gloss over the whole divisor thing and just pretend to add points of the curve themselves. But this answer is kind of unsatisfying, and it does generalize well in higher dimensions or for curves with singularities in them, so a better concept is needed.

Enter Cartier Divisors. In order to explain these, we’re technically going to need sheaves, but sheaves are a bit much, so I’ll try to handwave some things. The basic idea is, since we want to describe zeroes and poles, why don’t we just use zeroes and poles for that? Of course we can’t use a full function that is defined everywhere for that, that would only give us the principal divisors. But locally, we can use a function to describe zeroes or poles. Now what does locally mean? In algebraic geometry, the topologies we’re using are kind of weird. Here, we are using the Zariski topology, which for curves just means that when we say locally, we mean the whole curve with a finite number of points removed. We use this to remove the any zeroes or poles we don’t want for our divisor from our local representative.

All in all that means a Cartier divisor on a curve C is a covering (U_i), i.e. a collection of open sets (curve minus finite amounts of points) such that their union is the whole curve, and a rational function f_i per U_i, defined on U_i. This function’s zeroes and poles are what we understand as the divisor. Obviously, we now need to make sure all these partial functions work well as a whole. We do that by looking at U_i \cap U_j and the functions f_i and f_j restricted to that intersection. If we want this construction to define a consistent divisor, then f_i/f_j can not have any zeroes or poles in U_i \cap U_j. We write this as

f_i/f_j \in \mathcal{O}^\times (U_i \cap U_j)\;\;.

This now describes a consistent global object with zeroes and poles as we want them, getting quite close to describing divisors in a completely different way! We just have one problem, there are way too many functions with a specific pattern of zeroes and poles on our local neighborhood U_i, we need to get rid of all the extra behavior that isn’t just zeroes and poles! To do that, we need to look at two functions f_i and g_i on U_i that have the same pattern of zeroes and poles. What happens when we take f_i/g_i? Well we, as above, get a function without zeroes or poles on U_i. So if we want to forget all that extra structure, we need to take f_i modulo the set of functions without zeroes or poles on U_i. And that’s it.

If we write \mathcal{M}^\times(U_i) for the rational functions that are not equal to zero (so the rational functions that have a multiplicative inverse) and write \mathcal{O}^\times (U_i) for the functions without zeroes or poles on U_i, we can now describe a Cartier divisor as a covering (U_i) together with an element f_i \in \mathcal{M}^\times(U_i)/\mathcal{O}^\times(U_i) such that f_i/f_j\in\mathcal{O}^\times(U_i \cap U_j). A principal Cartier divisor is a Cartier divisor that can be described with just using just the entire curve C as the only element of the covering.

For extra bonus points (which I will not describe in detail here, because this blog post is already way too long and completely incomprehensible), we can look at what happens if we now take these Cartier divisors modulo principal Cartier divisors. It turns out, that the result can be described again with a covering U_i, but this time, instead of going through all that choosing of rational functions per set, we just use the intersections, and choose an element f_{ij}\in \mathcal{O}^\times (U_i \cap U_j), without even looking at rational functions in U_i at all, with some sheafy/cohomological rules for when two of those things are equal.