Why is `included` an array?

Why is included an array, rather than an object of types and keys, i.e.:

"included": {
    "people": {
        "834028": {
            "attributes": {
                "name": "Jane"
            }
        }
    }
    "dogs": {
        "363947": {
            "attributes": {
                "name": "Spot"
            }
        }
    }
}

I ask because if I understand correctly, the purpose of the included field is to deduplicate relationship attributes. Is that right?

If it is, then wouldn’t you never actually want to iterate over included (which is what an array is good for), but rather use it as a dictionary (what an object is good for), so that you can quickly look things up by their unique attributes (typically an id scoped by a type)?

1 Like

What happens when you try to order a heterogenous set?

If that’s an answer, I’m not sure I understand. Would you mind elaborating?

There’s an important rule for compound documents:

A compound document MUST NOT include more than one resource object for each type and id pair.

This applies to the primary data as well as included resources.

In other words, if a resource appears in data, then it MUST NOT appear in included.

The implication is that similar processing logic is required to serialize and deserialize a compound document’s primary and included data. One can’t simply rely on included to be a map in which every related resource can be looked up. It’s possible that those related resources have already been included as primary data.

Since there are similar requirements for processing primary and included data, it makes sense for them to be structured the same so that processing functions could be reused.

Thank you @dgeb, and sorry @chrisvfritz for the drive-by / mostly non-response answer.

To expand on my question, imagine you had a primary resource which represented a server in your infrastructure, and that server had a relation to many kinds of events of differing types (e.g. load events, authentication failures, deployments, etc).

If you make the request GET /servers/1?include=events&order=events.timestamp, it cannot be represented in the form you’ve suggested.

Additionally, imagine you wanted to stream the results of the request I just mentioned to a consumer. Assuming the format you mentioned was used (and that ordering wasn’t an issue) the server would have to buffer each typed bucket before it could begin transmission. The array format allows streaming to begin instantly.

Thanks @dgeb and @tkellen. I think I’m starting to understand the rationale, but please bear with me as I think there’s probably still more I’m missing, as I don’t think I agree with it yet.

In the case of fetching one or more servers with included, ordered events, the ordering would always be scoped by the server, correct? So then even with a heterogenous set, this would work. Then even in the case where you’re fetching data for 100 servers and in the response, you find a single server with events you’d like to inspect more closely, doing so is trivial. It’s just a dictionary lookup, rather than an expensive linear search through an array of possibly many thousands of events, checking each one to see if it belongs to the server.

I agree that it makes sense to process individual items in the primary and secondary data the same way (i.e. attributes should be in an attributes field, etc.), but it seems I wouldn’t want to process those collections the same way, because they’re inherently different. The primary data is an ordered set of requested resources, while included data should have no global order, but only one scoped by its relationship to each resource in the primary data. And this order can already be represented in the relationships field for that resource, so there’s no advantage to an array.

In my thinking, if I want to set a global order for included resources, I made the wrong request - because obviously, that’s the resource I really care about, so it should be the primary resource. For example, if I wanted to fetch events from several different servers and sort them by their timestamp, irrespective of their origin server, then I wouldn’t be querying for servers, but for events, with something like: GET /events/?servers=22,83,29,95,12&order=timestamp. Then streaming is no longer an issue. I can even include server data if I still care about it.

The only gotcha that I can think of would be if there were ever included data that had no relationship to the primary data, which would seem bizarre to me, but maybe there’s a use case I’m not thinking of where that would be necessary.

But as I said, there’s probably something I’m overlooking or not quite grokking yet. I imagine you all have spent a lot more time with this spec than I have.

I would never recommend looping through the payload, matching related resources to primary resources as you go. This is an inefficient deserialization strategy, and I can see why you’d be upset with this solution.

A better solution would be to load primary and related resources together into an identity map in the client, while also building up an array of references or type/id pairs for the primary data. In this way, you loop through the data and included arrays once for deserialization and you maintain the significance of the primary data.

I should reiterate that your solution of providing a map in included is not an adequate general solution for deserializing compound documents. It’s possible for related resources to be in the primary data as opposed to the included array. In that case, a lookup to a map in included would not find the resource.

This situation is not allowed in compound documents - there is a requirement for “full linkage” that prevents it.

As for sorting, the base spec only specifically supports sorting of primary data with the sort parameter. I agree that the order of IDs in to-many relationship data arrays is sufficient to indicate the order of a particular relationship. However, there is also the opportunity to add significance to the order of included in future versions of the spec, even if it is not significant now.

I hope that this helps clarify some of the reasoning that went into this decision.

I’m still not understanding unfortunately why this would be a better solution. In the example of the server with perhaps thousands of events, I have the option of iterating over it like a collection if I wanted to serialize it all at once. So an array has no advantage here. But in the quite common case that I don’t care about all the included data (or want to access it lazily), only the dictionary gives me the option of not serializing it all before I can do anything with the response as a whole. The array format still offers zero advantages that I can see.

Would you mind showing me an example of a compound document my solution would be inadequate for? I’ve never had a problem, using this strategy for years with many compound documents, but I may not have encountered that edge case yet. I also looked at the compound document example in the spec and there definitely wouldn’t be an issue looking up any of those relationships.

GET /people?include=friends

Let’s say this response is paginated. The response will include a page of people as primary data and other people in included (who are friends with the primary people). Some “friends” might be under data and others might be under included. people resources that are under data won’t be repeated under included.

Ahh, I see what you’re saying now. In this case, as long as the people in primary data are deserialized before attempting to deserialize relationships, there shouldn’t be a problem though, right? You’d just check to see if that person has already been deserialized and if not, check for them in included.

Please recognize the tradeoffs involved in your suggested approach:

  • Separate processing logic for primary vs. included data (wherein primary data is loaded into a memory map while included data is cached “as is”)
  • Duplication of types and ids in the included map + in the included resource objects
  • Inability to ever provide ordering significance to included resources (even though ordering is not significant now).

We carefully considered these tradeoffs pre-1.0. Our decision-making process was very open as the spec was being finalized. And I’ve done my best to explain our rationale. This is the most I can do now that the spec has reached 1.0 and the document structure can not be changed.

I’m watching your discussion, guys, cause some time ago I was asking why not to use type-id map for included too. And I really want to understand. So thanks @dgeb for trying to clarify this for us :slight_smile:

Yes, you’re right, but I don’t think it’s a big deal. From the point of aesthetics — yes, I want it to look and to be used the same way. But from the other point: on a server-side we use that type-id map for deduplication anyway. Even more: on a client-side we’re building that type-id map to build relationships easier. I’m not sure we’re doing it right, but it works for us.

Some time ago when json-api was in active development and we started to use a part of its concepts, we actually did that: included was a type-id map. And sure, we were removing type and id from resources. In addition, this saves some bytes.

I must say that with current json-api you can’t provide ordering too. The main reason: deduplication. Your example with people and friends: we have some people in data that are in friends too. Data has its own ordering, included — its own. The only way — using ordering of linkage objects. But same can be done with what @chrisvfritz proposes.

I have nothing else to add beyond ksimka’s response. Thanks @gdeb and @tkellen for your patience in explaining this. I agree that ideally, I would have asked these questions before 1.0 - this just happens to be the time when I first got a chance to look at it. :-/

I do like a lot of the concepts in json:api but it sounds like in the end, we just disagree on which tradeoffs make sense in some key cases. Thanks for the conversation!

1 Like

@ksimka and @chrisvfritz - It’s too bad that you disagree with the decision that was reached, but I’m glad for the opportunity to have this discussion. Hopefully, it’s not a showstopper for you in implementing the spec.

No, I agree, it’s good and it works. Just trying to fully understand the motivation behind (and you really help). I feel that without understanding I will handle it in a wrong way.

Shouldn’t this example use ‘sort’ instead of ‘order’? GET /servers/1?include=events&sort=events.timestamp

yes indeed - good catch :slight_smile:

Yup. I have no idea what I was thinking when I responded to this issue because basically everything I said was wrong. Many thanks to @dgeb for providing the clarity needed here.

Ah, no worries @tkellen. I know you’ve been crazy busy lately. Your input’s always appreciated!