How do you limit server load brought on through excessive inclusion?


#1

I understand that this isn’t really a topic regarding the spec, but I’m hoping someone might share their experience anyway.

I feel like inclusion has the potential to drastically stress a backend. While primary data can be subject to pagination, which in turn can also reduce load, no such approach seems to exist for inclusion.

This means that I can generate huge inclusion chains, even for a single resource, and cause the server to make extensive amounts of roundtrips with ever increasing resource counts. And even after all that, I can’t even trim down the size of the payload, as included records are not subject to pagination or truncation (however undesirable that may be in practice).

Obviously, when we implement our own queries, we make sure to construct them in a way that they perform as well as possible. But the API will be open to 3rd party developers who might maliciously or carelessly (but, most importantly, needlessly) construct queries that have potential to cause considerable load on different components of the infrastructure.

Upon realizing this, my first reaction was to establish resource (for example, time) quotas for queries. If a quota is exceeded, recursion brought on by inclusion is stopped and an error is returned instead.
However, this “solution” scares me as it might cause “friendly” queries to be aborted due to system load brought on by other factors.

I’d be happy to hear your thoughts :slight_smile:


#2

Beware the dragon named Premature Optimization, many a Knight has found their end chasing that beast.

This is an API architectural question, but before you go for large complex solutions to this problem my advice would be the following two steps.

  1. Have you seen the case of this happening on your system? Do you already have transaction timer limits? Have you implemented a circuit breaker pattern to handle long running query processes? Does your system fail under a scenario you can quickly create?

If you can’t answer the above questions, your first job should probably be to set up appropriate metrics to create benchmarks to be used to answer these questions. If your answer is yes to all of these questions, I would move on to the second step.

  1. Is there a simple heuristic solution you can use to prevent excess load or linkages? Will a simple relationship depth counter suffice? Are there more intricate patterns like cyclical references you can use to isolate the ‘problem’ to portions of the inclusion chain? Would a limit on the number of times a type can be included help? Are there data storage techniques outside of query management which could help (indexes, more memory, caches, … etc.)?

As far as your mention of quotas is concerned, I do touch on them as potential solutions at some level, though I stress the light-touch level of their inclusion. Generally, people want to include massive structures for the imposition of quotas on their system, which may work to centralize the control but often makes them overlook the easy wins in optimizing other portions of the application to solve the problem they are actually experiencing. Quotas in the sense of timeshares would likely be more bother than worth for consumer experience, and I don’t think a large query should directly effect the performance of a single node to the point where a noisy neighbor would slow other requests down, if that is the case there are larger architectural design issues which have been overlooked.

Another point I’ll add is your use of the word recursion terrifies me. As a computer scientist I love recursion, as an API designer and developer I hate it. Recursion breaks the normal iterative paradigm which is highly optimized in most runtimes, and has the potential to add a lot of inconsistency in the performance of an application. You do however use the word as more of an effect rather than a cause of the problem, which eases the concern greatly.

The last point I’ll make is as the implementer, you ultimately have control over everything behind the curtain. It is really easy to write queries which perform well against very small datasets and relatively few joins with a tool like JPA. However, a RDBMS with appropriate indexes are actually pretty awesome at flattening complex relationship chains of this type using methods like Common Table Expressions to easily and far more efficiently retrieve long related chains of data. If that doesn’t work, you always have caching and other strategies to speed up or break up the large arbitrary tasks into more parallel ones which can be executed within a timely and predictable fashion.

Almost all of the above details and questions are implementation details and solutions, and I discussed them to set the stage for the following point. Just because your product is your API, doesn’t mean that all complexity needs to be addressed at the surface of the API. The easier you make it for your consumer to use your product the way they want to, even if you don’t know how that is yet, the better your product is and the more likely you will be to gain and retain customers. Putting up unnecessary limits to those interactions will have lasting implications on customer retention and organic growth. Put out a really simple API which hides as much of the complexity of creating it as you possibly can within your budget and time frame, your bottom line will thank you for it.


#3

You are making some excellent points and there is definitely lots to think about. There are a few points I wanted to touch on for clarification right now though.

We are still in development of this new API, but we have our existing stack in which we need to integrate. As you’ve already identified, there are budgets and time frames and we can’t re-architect our entire data model to build a better API. We need a new approach to interact with our existing data and we have to work with certain aspects and architecture components as if they were set in stone at this time.

That being said, the problem I identified is already real in development and, given the ability to infinitely include other resources, has infinite potential to cause harm performance-wise. So I feel like this is something to think about right now. Even if it was only to familiarize yourself with the subject for when the time comes to apply the knowledge.

Given that any simple parent-child relationship has the potential to allow a user to request something with include=parent.children.parent.children.parent.children, not limiting something like that seems like a mistake. I’d be happy to accept that it is not a mistake and if I understand you correctly, you wouldn’t treat it like one and allow as much freedom as possible.

You mention that RDBMS are quite efficient at dealing with these problems and I would agree. However, while implementing, I did not see how to utilize the DB properly to implement inclusion (I feel like CTEs are definitely something I have not looked into properly).

Instead we just perform one query and then query the database again to retrieve the records that should be included. We do this recursively (hopefully the term is used appropriately :smile:) until we reached the end of all inclusion paths.

Regardless though, with infinite potential for inclusion (well, until you hit the string length limit the HTTP request parser accepts for a query argument), it doesn’t really matter how optimized your query will be processed. It might spend minutes in processing and cause issues all over the place.

I agree that a single slow query should not have such an impact that it takes out an entire node. But 20 might. And 20 requests is nothing that our rate limiting approach would care about.

After writing this, I feel like I’m being defensive because I’m looking for a solution to a problem in the wrong place :smiley:. Thanks again for your valued input.


#4

No need to be defensive at all. I don’t have anywhere near enough details needed to give you the direct solutions to your questions, and if I did they wouldn’t help you understand why they are the right solutions. I can however give high level advice which I hope is helpful to you in finding your own answers.

Recursion, yikes, I can see why that gives you pause. Not only does that strategy include many DB trips, but its a massive stack hog as well. Using your parent-child-parent scenario, it’s easy to write a single query using a CTE which can retrieve the entire set of rows, you can even limit the recursive depth of the join at your desired depth set in the include parameter. It’s reasonable to assume a mapping table for relationships in this way, but it’s important to keep in mind the message models probably are not the same as the data model.

The problem seems to be in your scenario it isn’t 20 queries, it’s 20 sets of unbounded recursive functions all firing their own DB queries at each invocation. With CTEs you should be able to compose dynamic SQL which will traverse the mapping tables very quickly, but as you said this could create arbitrary load on the system you can’t account for. However, I think you’ll find the database is much less likely the source of your performance issues than the multiple trips to the database your algorithm utilizes. I had a similar case of legacy software where some of the models composed themselves with independent queries. Instantiating the object generally took about 46 seconds to execute in our test environment. I parallelized the execution of these queries and got it down to about 12 seconds, which was a big gain but still very long running. I wrote a single query to gather all the data to compose the object in one trip, and on the second and subsequent invocations of the query (after it was in the DB SQL planner) the object generally took less than a second to instantiate. The point is modern databases are really good at quickly and efficiently handling these functions, and quite often the overhead of multiple database trips will be the cause of performance headaches in JPA (Hibernate) or other lazy loading ORM frameworks and methods.

It is quite reasonable to place sane limits on the interaction with the API, say an arbitrary limit at 12 relationships deep, but you want to be careful setting those limits to where they are beyond what is useful now and in the future, as you have stated you are constrained to portions of the design being immutable, and that trend is likely to continue.

Good luck again!