Tutorial: Organizing data

Serverless DatabaseData Structure Organizing data

A Webcom database may be seen as a single JSON document. Application developers can store data within this document freely using any tree-like structure. However, the design of this structure is very sensitive because it has a direct impact on performances of data fetching operations. This section explains the usual pitfalls you must avoid when designing data schemas.

The key point is understanding finely how the data we store is intended to be later retrieved.

Avoid nesting data

Because we can nest data up to a comfortable depth of levels (32) , it's tempting to think that this should be the default structure.

However, when we fetch data at a location in our database, we also retrieve all of its child nodes. Similarly when we grant someone read or write permissions at a data node in our database, we also grant them these permissions at all of its child nodes.

It's best to keep things as flat as possible, just as one would structure SQL tables.

Consider the following badly nested structure:

{
  "rooms": { // there is a lot of data to fetch as soon as you need reading this node
    "one": {
      "name": "room alpha",
      "type": "private",
      "messages": {
        "m1": { "sender": "mchen", "message": "foo" },
        "m2": { "sender": "hmadi", "message": "bar" },
        // a very long list of messages
      },
      "roomUsers": {
        "mchen": { "name": "Mary Chen" },
        "hmadi": { "name": "Hamadi Madi" }
        // a long list of members
      }
    }
    // a long list of rooms
  }
}

This design is problematic when you need listing only the names of the rooms:

  • From the data bandwidth point of view: you need fetching the entire "rooms" data node, including the whole content of all messages as well as the profile of all room members, which are all to be entirely downloaded on the client.
  • From the security perspective: assume you want to restrict read and write access to a room to only its members. To do so, you must set a security rule on each child of the "rooms" data node, which requires to remove read and write permissions directly on the "rooms" parent data node (otherwise, security rules of child nodes will never be evaluated). Consequently, the "rooms" data node cannot be fetched as a whole and so nobody can list the rooms...

Flatten data

The right way is to split data into separate paths (what is called “denormalization strategy”) as soon as you need to download segments of data in isolation. This strategy improve read performances and ease the design of security rules.

The previous example may be improved as follows:

{
  "rooms": { // this node now contains only meta data about each chat room...
    "one": { // ...stored under each room identifier
      "name": "room alpha",
      "type": "private"
    },
    "two": { /*...*/ },
    "three": { /*...*/ }
  },

  "roomUsers": { // this node contains separately the list of members for each chat room
    "one": { // room identifier
      // list of indices on the user identifiers
      "mchen": true,
      "hmadi": true
    },
    "two": {
      "mchen": true,
      "brinchen": true
    },
    "three": { /*...*/ }
  },

  "messages": { // this node contains separately the list of messages for each chat room
    "one": { // room identifier
      "m1": { "sender": "mchen", "message": "foo" },
      "m2": { "sender": "hmadi", "message": "bar" },
      "m3": { /*...*/ }
    },
    "two": { /*...*/ },
    "three": { /*...*/ }
  }
}

Now, when fetching the list of rooms (by reading the rooms data node), we get only the "public" meta data describing each room (name and type) without their whole content (messages and members). This involves a much faster download to show the user the list of rooms on the UI. When displaying the content of a room, the messages can be fetched separately from the corresponding child node of the messages node.

At the security rule level, we can grant read access to anyone at the rooms data node (public data), whereas we can now refine the read rule at the child nodes of the roomUsers and messages nodes, for example based on the room membership defined under the roomUsers node, so that only room members are granted access to messages and membership.

Scale data

Assume we now want to know which rooms a given user belongs to, following the denormalization strategy, we simply have to add a new root data node that lists all users of the application as follows:

{
  "users": {
    "mchen": {
      "name": "Mary Chen",
      "rooms": {
        // list of indices on the room identifiers
        "one": true,
        "two": true
      }
    },
    "brinchen": {
      "name": "Byambyn Rinchen",
      "rooms": {
        "two": true
      }
    }
  }
}

This is the best practice to model a two-way relationship between users and rooms:

  • The way "users towards rooms" (which answers the question “which rooms this user belongs to?”) is implemented by an index at the users/userId/rooms data node, which consists in a JSON object whose keys are the room identifiers, which the user identified by userId belongs to, and whose values are meaningless (usually the true boolean constant).
  • The way "rooms towards users" (which answers the question “which users belong to this room?”) is implemented by an index at the roomUsers/roomId data node, which consists in a JSON object whose keys are the user identifiers, which belong to the room identified by roomId, and whose values are meaningless (usually the true boolean constant).

As you can see, the representation of a two-way relationship in a tree-like database requires duplicating a part of the data. This redundancy is necessary to ensure both efficient fetch operations (by exposing nodes with only the required data and so avoiding reading the whole data at each request) and fine-grained security rules} (by exposing nodes that give access to only a way of a relationship and so avoiding giving a permission on too large piece of data).

If we went on with this example by requiring to fetch the list of all users of the application, we would have to split the users node following the same technique. For example:

{
  "users": { // this node now contains only meta-data about each user
    "mchen": { // user identifier
      "name": "Mary Chen"
    },
    "brinchen": {
      "name": "Byambyn Rinchen"
    }
    /*...*/
  },
  "userRooms": { // this node contains separately the list of rooms each user belongs to
    "mchen": { // user identifier
      // list of indices on the room identifiers
      "one": true,
      "two": true
    },
    "brinchen": {
      "two": true
    }
    /*...*/
  }
}

Manage long lists

When managing very long lists of items in an application (for example, a list of rooms, a list of messages or a list of users) we face the Webcom back end width limitation that prevents creating data node with more than 50,000 children.

In order to work around this limitation, the Webcom database provides with virtual hashed data nodes, which extend the number of children to over 1015. In return, these virtual nodes consume 4 depth levels in the tree structure (as a reminder, a Webcom tree-like database allows up to 32 levels).

How does it work?

A virtual hashed data node is automatically created when creating a data node whose name starts with "#" or ".hash.". The #foo data node actually expands to 4 nested nodes of the form XX/YY/ZZ/foo, where XXYYZZ is a balanced hash of foo (the name of the hashed node). In this way, the first three expanded levels supporting each 4,096 children max, at a given data node, you can store up to 4,0963=68,719,476,736 virtual hashed child nodes.

Virtual specified database

{
  "rooms": {
    "#one": { // virual hashed node
      "name": "room alpha",
      "type": "private"
    },
    "#two": { /*...*/ },
    "#three": { /*...*/ },
    /*...*/
  }
}

Actual hosted database

{
  "rooms": {
    "_g": {        // automatic expansion
      "W8": {      // of the virtual
        "3N": {    // hashed node
          "one": { // "#one"
            "name": "room alpha",
            "type": "private"
          }
        }
      }
    },
    "rX": {         // automatic expansion
      "gu": {       // of the virtual     
        "za": {     // hashed node        
          "two": { // "#two"
            /*...*/
          }
        }
      }
    },
    "uA": {           // automatic expansion
      "Lz": {         // of the virtual     
        "hD": {       // hashed node        
          "three": { // "#three"             
            /*...*/
          }
        }
      }
    },
    /*...*/
  }
}

Of course usage of virtual hashed data nodes is supported in every Webcom SDK (replace “<your-app>” with your actual application identifier):

const app = new Webcom("<your-app>");
app.child("rooms/#one/name").set("room alpha", () => {
    app.child("rooms").once("value", snap => console.log("Value is:", snap.child("#one/name").val()));
    // => logs "Value is: room alpha"
});

Note that the child() method of both Webcom reference nodes and DataSnapShot objects passed to the read callbacks accommodate hash-prefixed path segments to refer to virtual hashed nodes.

val app = WebcomApplication.default
val manager = app.datasyncService.createManager()
val nameNode = manager / "rooms/#one/name"

nameNode.set("room alpha") {
    if (it is WebcomResult.Success) {
        nameNode.get {
            println("value is: ${it.get().asNative}")
        }
    }
}
let app = Webcom.defaultApplication
let manager = app.datasyncService.createManager()
manager.node(for: "rooms/#one/name")?.set("room alpha") { result in
    switch result {
    case let .failure(error):
        print("write error:", error)
    case .success:
        manager.node(for: "rooms/#one/name")?.value { result in
            print("Value is: ", result.value)
            // => logs "Value is: room alpha"
         }
    }
}
curl -X PUT "http://https://io.datasync.orange.com/datasync/v2/<your-app>/data/rooms/%23one/name" -H "Content-Type: application/json"
     -d '"room alpha"'

Note that you must URL-encode the "#" prefix.

curl "http://<your-pf>/datasync/v2/<your-app>/data/rooms/%23one/name"
# => returns "room alpha"

As virtual hashed nodes are intended to support very large lists, most of the time, their parent nodes are likely to hold very large data that cannot be fetched as a single chunk because their size exceeds the maximum weight limitation of 10 MiB per read request.
Consequently, read operations are usually requested on virtual hashed data nodes (or their children) rather than on their parent nodes.

List of independent items

An application may contain a list of items that are intended to be accessed separately, for both read and write. In the address book example, the contact cards are independent of each others: reading (or writing) them as a whole in a single operation is out of interest.

In such cases, Webcom can be asked to distribute the elements of this list over several processes and/or servers (how this distribution is actually implemented and works is out of the scope of this documentation). As a result, the data node representing this list is handled as a "fragmented" list: it cannot be accessed as a whole, but each of its items can independently and individually be accessed with higher performance in terms of frequency and concurrency.

Requiring the fragmentation of a data node is simply done by prefixing its key (or name) with .list..

Here is an example on an address book:

const ref = new Webcom("addressBook");
ref.child(".list.contacts").set(/*...*/) // set an initial value as a whole
ref.child(".list.contacts/John").get()   // returns John's details
ref.child(".list.contacts/Paul").set(/*...*/) // update (or add) Paul's contact card

// a push is not really meaningful in this example but would be allowed 
ref.child(".list.contacts").push(/*...*/)

// Execution of the following code would lead to an error from the Webcom back end:
ref.child(".list.contacts").get() // not allowed because a fragmented node cannot be read as a whole
ref.child(".list.contacts").merge(/*...*/) // not allowed because a fragmented node cannot be read as a whole
val app = WebcomApplication.default
val manager = app.datasyncService.createManager()

@Serializable
data class Contact(/*...*/)
val paul = Contact(/*...*/)

val contactNode = manager / ".list.contacts"
contactNode.set(/*...*/) {/*...*/}          // set an initial value as a whole
(contactNode / "John").get() {/*...*/}      // returns John's details
(contactNode / "paul").set(paul) {/*...*/}  // update (or add) Paul's contact card

// a push is not really meaningful in this example but would be allowed 
contactNode.push(/*...*/) {/*...*/}

// Execution of the following code would lead to an error from the Webcom back end:
contactNode.get() {/*...*/}  // not allowed because a fragmented node cannot be read as a whole
contactNode.merge(/*...*/) {/*...*/} // not allowed because a fragmented node cannot be read as a whole
let app = Webcom.defaultApplication
let manager = app.datasyncService.createManager()

struct Contact: Codable {/*...*/}
let paul = Contact(/*...*/)

manager.node(for: ".list.contacts")?.set(/*...*/) {/*...*/}    // set an initial value as a whole
manager.node(for: ".list.contacts/John")?.get() {/*...*/}      // returns John's details
manager.node(for: ".list.contacts/Paul")?.set(paul) {/*...*/}  // update (or add) Paul's contact card

// a push is not really meaningful in this example but would be allowed 
manager.node(for: ".list.contacts")?.push() {/*...*/}

// Execution of the following code would lead to an error from the Webcom back end:
manager.node(for: ".list.contacts")?.get() {/*...*/}   // not allowed because a fragmented node cannot be read as a whole
manager.node(for: ".list.contacts")?.merge() {/*...*/} // not allowed because a fragmented node cannot be read as a whole

As illustrated in this example trying to get, set or update such a list of items as a whole is not allowed and leads to an error.

A "fragmented" node representing a list of items can hold very large data that exceed the maximum weight limitation of 10 MiB, but cannot exceed the weight limitation of 50,000 children. In order to overcome the limitation of 50,000 children, you have to use "hashed" nodes.

Non fragmentable nodes

A very large node, in terms of data size, or with high operation throughput, may be considered by the Webcom back end as a "list of items" in a very similar way to the one explained above, so that it may decide to seamlessly fragment it in order to optimise the frequency and concurrency of accesses to its children. The difference however is that this node still remains allowed to be accessed as a whole (i.e. it can be get or set in a single operation), and if so, the back end is able to seamlessly "unfragment" the node, that is host all its children back under the same process or server.

Most of the time, this optimisation is very welcome, as long as the node is really used as a list of items being accessed separately, but it can highly degrades performances in case the large node needs to be accessed in a single request (the unfragmentation operation is costly).

When needed, a node can be prevented from being fragmented simply by prefixing its key (or name) with .item.. Hereafter is an example:

const ref = new Webcom("myApplication");
ref.child(".item.indicators").set({/*a large JSON object*/});
// this node is created as a "whole", no technical optimisation allowed leading to fragment the node.
val app = WebcomApplication.default
val manager = app.datasyncService.createManager()

@Serializable
data class IndicatorValues(/*...*/)
val indicators = IndicatorValues(/*a large data object */)

val node = manager / ".item.indicators"
node.set(indicators) {/*...*/}
// this node is created as a "whole", no technical optimisation allowed leading to fragment the node.
let app = Webcom.defaultApplication
let manager = app.datasyncService.createManager()

struct IndicatorValues: Codable {/*...*/}
let indicators = IndicatorValues(/*a large data object */)

manager.node(for: ".item.indicators")?.set(indicators) {/*...*/}
// this node is created as a "whole", no technical optimisation allowed leading to fragment the node.