How we give every user SQL access to a shared ClickHouse cluster (trigger.dev)

50 points by eallam 4 days ago

zie 8 hours ago

We do the same thing, every employee can access our main financial/back office SQL database, but we just use PostgreSQL with row level security[0]. We never bothered to complicate it like the post does.

0: https://www.postgresql.org/docs/18/ddl-rowsecurity.html

orf 8 hours ago

Back office, employee access is a completely different problem to what is described in the post.

How do you enforce tenant isolation with that method, or prevent unbounded table reads?

tossandthrow 7 hours ago

They likely don't need tenant isolation and unbound table reads can be mitigated using timeouts.

We do something similar for our backoffice - just with the difference that it is Claude that has full freedom to write queries.

weird-eye-issue 6 hours ago

RLS...

lyjackal 5 hours ago

I want to build a shared postgres db with hundreds of small apps (OLTP) accessing shared tables using a RLS model model against well defined tables.

What are other limitations and mitigations folks have used or encountered to support stability and security? Things like

  - Query timeouts to prevent noisy neighbors
  - connection pooling (e.g. pgbouncer) also for noisy neighbors
  - client schema compatibility (e.g. some applications running older versions, have certain assumptions about the schema that may change over time)

zie 4 hours ago

If you have people running crappy SQL SELECT, it can be a problem. statement-timeout[0] is your friend here. You still have to be on watch, and teach your users not to write crappy SQL.

You can also limit it by creating read-only replica's and making SELECT's happen on the replica. We don't usually bother, since 99% of our users are employees of ours, we can teach them to not be stupid. Since their usage doesn't change much over time, we can usually just hand them a SQL query and say: here run this instead.

Most of our employees don't even know they have SQL access, it's not like we force people to learn SQL to get their job done. Because of RLS and views, the ones that do SQL don't have to know much SQL, even if they do happen to use it. SELECT * from employees; gets them access to basically all the employee info they could want, but only to the employees they have access to. If you are a manager with 10 people, your select returns only your 10 people.

The payroll staff runs the same query and gets all of the employees they handle payroll for. Since our payroll is done inside of PostgreSQL(thanks plpython[1]), we can do some crazy access control stuff that most systems would never even dream about. Whenever new auditors come in and see that our payroll staff is limited to seeing only the info they need to do payroll, and only for their subset of employees they actually pay, they are awestruck.

The random vendors that can't be taught, we usually hand them a nightly SQLite dump instead. I.e let them pay the CPU cost of their crappy SQL.

Around client schema compatibility. This happens with other models too(API, etc). It's not unique to PG or SQL Databases. You have to plan for it. Since most all of our users interact with views and not with the actual underlying tables, it's not usually that big of a deal. In the extreme cases, where we can't just keep around a view for them, we have to help them along(sometimes kicking and screaming) into a new version.

0: https://www.postgresql.org/docs/current/runtime-config-clien...

1: https://www.postgresql.org/docs/current/plpython.html

staticassertion 6 hours ago

I'd be so uncomfortable with this. It sounds like you're placing the full burden of access on a single boundary. I mean, maybe there's more to it that you haven't spoken about here, but "everything rests on this one postgres feature" is an unacceptably unsafe state to me.

weird-eye-issue 6 hours ago

It's not like RLS is just some random feature they are misusing. It's specifically for security and is absolutely reliable. Maybe you should do a bit more research before making comments like this.

staticassertion 6 hours ago

skeeter2020 6 hours ago

row level security is not a feature specific to Postgres, but more a pretty standard and acceptable way to control access in a multitenant or multicontext environment that pretty much every data provider supports/implements. When it comes to answering a single specific question (like the one RLS targets) I believe you DO want a single, simple answer, vs. something like "it uses these n independent things working in conjunction..."

staticassertion 6 hours ago

zie 5 hours ago

Well, I mean it's not only RLS, but yes it's only PostgreSQL doing the access control as far as if they can see a particular table or row.

Every user gets their own role in PG, so the rest of the PG access control system is also used.

We have your normal SSO system(Azure) and if Tootie employee doesn't need access to Asset Control, they don't get any access to the asset schema for instance.

What would be your method?

You would have some app that your dev team runs that handles access control, so your app gets unrestricted access to the DB. Now your app is the single boundary, and it forces everyone to go through your app. How is that better? It also complicates your queries, with a ton of extra where conditions.

A bunch of bespoke access control code you hope is reliable or a feature of the database that's well tested and been around for a long time. pgtap[0] is amazing for ensuring our access control (and the rest of the DB) works.

If some random utility wants to access data, you either have to do something special access wise, or have them also go through your app(let's hope you have an API and it allows for whatever the special is). For us, that random utility gets SQL access just like everyone else. They get RLS applied, etc. They can be naive and assume they have total control, because when they do select * from employees; they get access to only the employee column and rows we want that utility to have.

We have a bunch of tools over the decades that need access to various bits of our data for reason(s). Rather than make them all do wacky stuff with specialized API's, they just get bog standard PG SQL. We don't have to train vendor Tito how to deal with our stuff, we just hand them their auth info to PG and they can go to town. When people want Excel spreadsheets, they just launch excel, do a data query and their data just shows up magically. All from within Excel, using the standard excel data query tools, no SQL needed.

0: https://pgtap.org/

staticassertion 5 hours ago

Philip-J-Fry 6 hours ago

Conceptually that's no different to any security measures that prevent you from accessing data you're not supposed to? At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.

We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.

staticassertion 6 hours ago

cjonas 6 hours ago

We just create mini data "ponds" on the fly by copying tenant isolated gold tier data to parquet in s3. The users/agent queries are executed with duckdb. We run this process when the user start a session and generate an STS token scoped to their tenant bucket path. Its extremely simple and works well (at least with our data volumes).

mattaitken 5 hours ago

This is cool. I think for our use case this wouldn’t work. We’re dealing with billions of rows for some tenants.

We’re about to introduce alerts where users can write their own TRQL queries and then define alerts from them. Which requires evaluating them regularly so effectively the data needs to be continuously up to date.

SOLAR_FIELDS 30 minutes ago

Billions still seems crunchable for DDB. It’s however much you can stuff into your RAM no? Billions is still consumer grade machine RAM depending on the data. Trillions I would start to worry. But you can have a super fat spot instance where the crunching happens and expose a light client on top of that then no?

Quadrillions, yeah go find yourself a trino spark pipeline

Waterluvian 6 hours ago

Is that why it’s called DuckDb? Because data ponds?

QuantumNomad_ 6 hours ago

The DuckDB website has the following to say about the name:

> Why call it DuckDB?

> Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck's song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system.

https://duckdb.org/faq#why-call-it-duckdb

cjonas 6 hours ago

Idk but I named everything in the related code "duckpond" :)

mritchie712 6 hours ago

Hannes (one of the creators) had a pet duck

otterley 5 hours ago

How large are these data volumes? How long does it take to prepare the data when a customer request comes in?

cjonas 5 hours ago

Small. We're dealing with financial accounts, holdings and transactions. So a user might have 10 accounts, thousands of holdings, 10s of thousands of transactions. Plus a handful of supplemental data tables. Then there is market data that is shared across tenants and updated on interval. This data is maybe 10-20M rows.

Just to clarify, the data is prepared when the user (agent) analytics session starts. Right now it takes 5-10s, which means it's typically ready well before the agent has actually determined it needs to run any queries. I think for larger volumes, pg_duckdb would allow this to scale to 10s of millions rows pretty efficiently.

boundlessdreamz 6 hours ago

How do you copy all the relevant data? Doesn't this create unnecessary load on your source DB?

cjonas 6 hours ago

We have various data sources (which is another benefit of this approach). Data from the application DB is currently pulled using the FE apis which handle tenant isolation and allow the application database to deal with the load. I think pg_duckdb could be a good solution here as well, but haven't gotten around to testing it. Other data come from analytics DB. Most of this is landed on an interval via pipeline scripts.

senorrib 8 hours ago

Reasons 1-3 could very well be done with ClickHouse policies (RLS) and good data warehouse design. In fact, that’s more secure than a compiler adding a where to a query ran by an all mighty user.

Reason 4 is probably an improvement, but could probably be done with CH functions.

The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.

hrmtst93837 an hour ago

A DSL for access control hides the risky part, nobody audits it until someone pokes a hole and prod becomes the test env.

mattaitken 6 hours ago

You’re right RLS can go a long way here. With complex RBAC rules it can get tricky though.

The main advantages of a DSL are you can expose a nicer interface to users (table names, columns, virtual columns, automatic joins, query optimization).

We very intentionally kept the syntax as close to regular ClickHouse as possible but added some functions.

efromvt 7 hours ago

As long as you don't deviate too much from ANSI, I think the 'light sql DSL' approach has a lot of pros when you control the UX. (so UIs, in particular, are fantastic for this approach - what they seem to be targeting with queryies and dashboards). It's more of a product experience; tables are a terrible product surface to manage.

Agreed with the ecosystem cons getting much heavier as you move outside the product surface area.

skeeter2020 5 hours ago

Personally I think that's worse. SQL - which is almost ubiqutous - already suffers from a fragmentation problem because of the complex and dated standardization setup. When I learn a new DBMS the two questions I ask at the very start are: 1. what common but non-standard features are supported? 2. what new anchor-features (often cool but also often intended to lock me to the vendor) am I going to pick up?

First I need to learn a new (even easy & familiar) language, second I need to be aware of what's proprietary & locks me to the vendor platform. I'd suspect they see the second as a benefit they get IF they can convince people to accept the first.

jelder 7 hours ago

We did this with MotherDuck, and without introducing a new language. Every tenant has their own isolated storage and compute, so it’s trivial to grant internal users access to specific tenants as needed. DuckDB’s SQL dialect is mostly just Postgres’ with some nice ergonomic additions and a host of extra functionality.

raw_anon_1111 7 hours ago

This is explicitly not the problem they are trying to solve. In a single tenant database you don’t have to by definition worry about multi tenant databases

DangitBobby 7 hours ago

I guess the question then becomes, what problem does a multi-tenancy setup solve that an isolated database setup doesn't? Are they really not solving the same problem for a user perspective, or is it only from their own engineering perspective? And how do those decisions ultimately impact the product they can surface to users?

raw_anon_1111 6 hours ago

steveBK123 6 hours ago

mattaitken 6 hours ago

In a system with organizations, projects and advanced user access permissions having separate databases doesn’t full solve the problem. You still need access control inside each tenanted database. It also makes cross-cutting queries impossible which means users can’t query across all their orgs for example.

The DSL approach has other advantages too: like rewriting queries to not expose underlying tables, doing automatic performance optimizations…

bob1029 5 hours ago

> How do you let users write arbitrary SQL against a shared multi-tenant analytical database without exposing other tenants' data or letting a rogue query take down the cluster?

For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.

nlittlepoole 5 hours ago

Open Table Formats (Iceberg, Delta Lake, Hudi, etc) are the approach we've taken. That let's us offer a query engine but also let's the tenant bring their preferred engine (Snowflake, Spark, DuckDB, etc). It also addresses dirty reads and some other state problems that come from trying to use the file system. It scales as much as the bucket does, so we haven't found a use case we couldn't scale to yet.

We (https://prequel.co) recently started offering this as a white labeled capability so anyone can offer it without building it yourself. Its a newer capability to our export product where instead of sending the data to the tenant's data warehouse, we enable you to provision an S3/GCS/ABS/etc bucket with the data formatted. Credential management, analytics, etc is all batteries included so you don't have to do that either. The initial interest from our customers was around BI integrations but agent use is starting to pick up which is kinda interesting to see.

r1290 2 hours ago

How does it handle large tables like. 2b rows? And how does it stay updated?

elnatro 7 hours ago

New to ClickHouse here. Would you thing this kind of database has a niche when compared to usual RDBMS like MySQL and PostgreSQL?

mattaitken 5 hours ago

ClickHouse is a high performance OLAP database. It’s good for analytics and search.

We use it (I’m the author or the article) so users can search every run they do and graph all sorts of metrics.

baalimago 6 hours ago

The evolution of this is to use agents, and have users "chat with the data"

mattaitken 6 hours ago

Yes, you can actually do this already because we expose a REST API and TypeScript SDK functions to execute the queries.