Scaling Ory Hydra to ~2bn monthly OAuth2 flows on a single PostgreSQL DB

Ory Hydra is the most popular open-source OAuth2 and OpenID Connect server that provides secure authentication and authorization for applications. One of the key challenges in building a scalable and performant OAuth2 server is managing the persistence layer, which involves storing and retrieving data from a database.

Motivation

Ory was approached by a popular service provider to optimize performance in their auth system at high load. They were currently using Auth0 and often struggled to cope with the huge influx of authorization grants during peak times (over 600logins/sec). Looking for another solution, they started to evaluate Ory Hydra, investigating if it can handle this amount of grants. After reaching out to the Ory team, we started to assess Ory Hydra's performance capabilities and investigate ways to improve the overall performance to make Ory Hydra faster and more scalable than ever before. The key to success was to re-engineer parts of Hydra's persistence layer to reduce write traffic to the database, moving to a transient OAuth2 flow.

Moving to a transient OAuth2 flow

One of the core parts of this work was moving a large chunk of the transient OAuth2 flow state, which is exchanged between the three parties involved in an OAuth2 flow, from the server to the client. Instead of persisting transient state to the database, the state is now passed between the parties as either AEAD-encoded cookies or AEAD-encoded query parameters in redirect URLs. AEAD stands for authenticated encryption with associated data, which means that the data is confidential and also can't be tampered with without knowing a secret (symmetric) key.

The flow is then only persisted in the database once when the final consent is given.

This change has several benefits. First, it reduces the amount of data that needs to be stored in the database, which in turn reduces write traffic. Second, it eliminates the need for multiple indices on the flow table that were previously used during the exchange.

The OAuth2 authorization code grant flow in detail

The relevant part of the OAuth2 flow that we wanted to optimize is an exchange between the client (acting on behalf of a user), Hydra (Ory's OAuth2 Authorization Server), and the login and consent screens. When a client requests an authorization code through the Authorization Code Grant, the user will be redirected first to the login UI to authenticate and then to the consent UI to grant access to the user's data (such as the email address or profile information).

Below is a sequence diagram of the exchange. Observe that each UI gets a CHALLENGE as part of the URL parameters (steps 3 and 12) and then uses this CHALLENGE as a parameter to retrieve more information (steps 4 and 13). Finally, both UIs either accept or reject the user request, usually based on user interaction with the UI (from steps 6 to 8 and 15 to 17). This API contract keeps Ory Hydra headless and decoupled from custom UIs.

Sequence diagram

Optimization: Passing the AEAD-encoded flow in the URL parameters

To reduce database access we now pass as the LOGIN_CHALLENGE, LOGIN_VERIFIER, CONSENT_CHALLENGE, and CONSENT_VERIFIER an AEAD-encoded flow. This way, we rely on the parties involved in the OAuth2 flow to pass the relevant state along.

Before	After
The login and consent challenges and verifiers are random UUIDs stored in the database.	The login and consent challenges and verifiers are the AEAD-encoded flow.
Accepting or rejecting a request from the UI involves a database lookup for the specific challenge.	Accepting or rejecting a request from the UI involves decrypting the flow in the challenge and generating an updated flow as part of the verifier.

Implementation

Since Ory Hydra is open source, you can review code changes in the Ory GitHub repositories. This is the relevant commit: https://github.com/ory/hydra/commit/f29fe3af97fb72061f2d6d7a2fc454cea5e870e9.

Here is where we encode the flow in the specific challenges and verifiers:

// ToLoginChallenge converts the flow into a login challenge.
func (f *Flow) ToLoginChallenge(ctx context.Context, cipherProvider CipherProvider) (string, error) {
    return flowctx.Encode(ctx, cipherProvider.FlowCipher(), f, flowctx.AsLoginChallenge)
}

// ToLoginVerifier converts the flow into a login verifier.
func (f *Flow) ToLoginVerifier(ctx context.Context, cipherProvider CipherProvider) (string, error) {
    return flowctx.Encode(ctx, cipherProvider.FlowCipher(), f, flowctx.AsLoginVerifier)
}

// ToConsentChallenge converts the flow into a consent challenge.
func (f *Flow) ToConsentChallenge(ctx context.Context, cipherProvider CipherProvider) (string, error) {
    return flowctx.Encode(ctx, cipherProvider.FlowCipher(), f, flowctx.AsConsentChallenge)
}

// ToConsentVerifier converts the flow into a consent verifier.
func (f *Flow) ToConsentVerifier(ctx context.Context, cipherProvider CipherProvider) (string, error) {
    return flowctx.Encode(ctx, cipherProvider.FlowCipher(), f, flowctx.AsConsentVerifier)
}

In the persister (our database repository) we then decode the flow contained in the challenge. For example, here's the code for handling a consent challenge:

func (p *Persister) GetFlowByConsentChallenge(ctx context.Context, challenge string) (*flow.Flow, error) {
    ctx, span := p.r.Tracer(ctx).Tracer().Start(ctx, "persistence.sql.GetFlowByConsentChallenge")
    defer span.End()

    // challenge contains the flow.
    f, err := flowctx.Decode[flow.Flow](ctx, p.r.FlowCipher(), challenge, flowctx.AsConsentChallenge)
    if err != nil {
        return nil, errorsx.WithStack(x.ErrNotFound)
    }
    if f.NID != p.NetworkID(ctx) {
        return nil, errorsx.WithStack(x.ErrNotFound)
    }
    if f.RequestedAt.Add(p.config.ConsentRequestMaxAge(ctx)).Before(time.Now()) {
        return nil, errorsx.WithStack(fosite.ErrRequestUnauthorized.WithHint("The consent request has expired, please try again."))
    }

    return f, nil
}

Let's look at the impact of the changes when compared to the code without optimizations:

The flows are now much faster and talk less to the database.

Improved indices lead to further performance improvements

By introducing a new index on the hydra_oauth2_flow table, we were able to increase throughput and decrease CPU usage on PostgreSQL. The screenshot below shows the execution of the benchmarks without the improved indices where CPU usage spikes to 100%, and with improved indices, where CPU usage stays below 10%.

With the newly added indices, CPU usage (green bars) is removed, which reduces the likelihood of BufferLocks and related issues:

Results

The code and database changes reduced the total roundtrips to the database by 4-5x (depending on the amount of caching done) and reduced database writes by about 50%.

Benchmarks

Benchmarking the new implementation on Microsoft Azure with the following specifications:

Service(s)	Configuration	Total max SQL connections	Notes
Ory Hydra Consent App OAuth2 Client App rakyll/hey (http benchmark tool)	3x Standard_D32as_v4; South Central US 5x Standard_D8s_v3; South Central US	512	Every VM ran all the processes mentioned.
PostgreSQL 14 in HA configuration	Memory Optimized, E64ds_v4, 64 vCores, 432 GiB RAM, 32767 GiB storage; South Central US		RAM beats CPU.

Ory can perform up to 1090 logins per second at the peak and 800 logins / second consistently in the above configuration. This is possible by making the flow stateless and optimizing indices in frequently used queries.

Conclusion

The performance optimization work done by the Ory team has resulted in a significant improvement in Hydra's performance and scalability. By reducing write traffic to the database and improving the codebase and dependencies, Hydra is now faster and more responsive than ever before. By improving the indices, Hydra now scales much more efficiently with the number of instances.

In the future, we will continue to optimize Ory's software to handle even more traffic. We believe that it's possible to get 5x more throughput on a single PostgreSQL node with data model optimizations.

If you're building an OAuth2 server, we highly recommend giving Ory's fully certified OpenID Connect and OAuth2 implementations a try: Ory OAuth2 – our fully managed service running on the global Ory Network, based on open source Ory Hydra – already utilizes the optimizations described in this article and setting it up only takes few minutes!