Tutorial: Data architecture

Step by step Data architecture

Building properly structured NoSQL stores requires quite a bit of forethought. Most importantly, we need to understand how the data will be read back later, and how to make that process as easy as possible.

Avoid building nests

Because we can nest data up to infinite levels deep, it's tempting to think that this should be the default structure. However, when we fetch data at a location in our database, we also retrieve all of its child nodes. Therefore, in practice, it's best to keep things as flat as possible, just as one would structure SQL tables. Consider the following badly nested structure:

{
    // a poorly nested data architecture, because
    // iterating "rooms" to get a list of names requires
    // potentially downloading hundreds of megabytes of messages
    "rooms": {
      "one": {
        "name": "room alpha",
        "type": "private",
        "messages": {
          "m1": { "sender": "mchen", "message": "foo" },
          "m2": { ... },
          // a very long list of messages
        }
      }
    }
}

With this nested design, iterating the data becomes problematic. Even a simple operation like listing the names of rooms requires that the entire rooms tree, including all members and groups, be downloaded to the client.

Flatten your data

If the data were instead split into separate paths (i.e. denormalized), it could be effeciently downloaded in segments, as it is needed. Consider this flattened architecture:

{
    // rooms contains only meta info about each room
    // stored under the room's unique ID
    "rooms": {
      "one": {
        "name": "room alpha",
        "type": "private"
      },
      "two": { ... },
      "three": { ... }
    },

    // room members are easily accessible (or restricted)
    // we also store these by room ID
    "members": {
      // we'll talk about indices like this below
      "one": {
        "mchen": true,
        "hmadi": true
      },
      "two": { ... },
      "three": { ... }
    },

    // messages are separate from data we may want to iterate quickly
    // but still easily paginated and queried, and organized by room ID
    "messages": {
      "one": {
        "m1": { "sender": "mchen", "message": "foo" },
        "m2": { ... },
        "m3": { ... }
      },
      "two": { ... },
      "three": { ... }
    }
}

It's now possible to iterate the list of rooms by only downloading a few bytes per room, quickly fetching meta data for listing or displaying rooms in a UI. Messages can be fetched separately and displayed as they arrive, allowing the UI to stay responsive and fast.

Creating data that scales

A lot of times in building apps, it's preferable to download a subset of a list. This is particularly common if the list contains thousands of records or more. When this relationship is static, and one-directional, we can simply nest the child objects under the parent:

{
    "users": {
      "john": {
         "todoList": {
            "rec1": "Walk the dog",
            "rec2": "Buy milk",
            "rec3": "Win a gold medal in the Olympics"
         }
      }
    }
  }

But often, this relationship is more dynamic, or it may be necessary to denormalize this data (John could have a more realistic todo list with a few thousand entries).

Consider, for example, a two-way relationship between users and groups. Users can belong to a group and groups comprise a list of users. Since they cannot be nested both ways, a first attempt at this data structure would probably look this:

// A first attempt at a two-way relationship
  {
    "users": {
      "mchen": { "name": "Mary Chen" },
      "brinchen": { "name": "Byambyn Rinchen" },
      "hmadi": { "name": "Hamadi Madi" }
    },
    "groups": {
      "alpha": {
         "name": "Alpha Tango",
         "members": {
            "m1": "mchen",
            "m2": "brinchen",
            "m3": "hamadi"
         }
      },
      "bravo": { ... },
      "charlie": { ... }
    }
  }

Great start! But when it comes time to decide which groups a user belongs to, things get complicated. We can monitor all the groups and iterate them every time there is a change, but this is costly and slow. Even worse, what if Mary isn't allowed to see all of those groups? When we try to fetch the entire list we'll get an error telling us the operation wasn't allowed.

What we would like instead is an elegant way to list the groups Mary belongs to and fetch only data for those groups. An index of Mary's groups can help a great deal here:

// An index to track Mary's memberships
  {
    "users": {
      "mchen": {
        "name": "Mary Chen",
        // index Mary's groups in her profile
        "groups": {
           // the value here doesn't matter, just that the key exists
           "alpha": true,
           "charlie": true
        }
      },
      ...
    },
    "groups": { ... }
  }

Didn't we just duplicate some data by storing the relationship under both Mary's record and under the group? We now have mchen indexed under a group and alpha listed in Mary's profile. So in order to delete Mary from the group, it has to be updated in two places?

Yes. This is a necessary redundancy for two-way relationships. It allows us to quickly and efficiently fetch Mary's memberships, even when the list of users or groups scales into the millions, or when Security and Firebase Rules would prevent access to some of the records.

Why do we invert the data by listing the ids as keys and setting the value to true? There are a few good reasons for this approach. It makes checking for a key very easy since we can just read /users/mchen/groups/$group_id and see if it is null.

// see if Mary is in the 'alpha' group
var ref = new Webcom("[[baseUrl]]/base/web/org/users/mchen/groups/alpha");
ref.once('value', function(snap) {
  var result = snap.val() === null? 'is not' : 'is';
  console.log('Mary ' + result + ' a member of alpha group');
});
// see if Mary is in the 'alpha' group
Webcom ref = new Webcom("[[baseUrl]]/base/web/org/users/mchen/groups/alpha");
ref.once('value', Query.Event.VALUE, new OnQuery(){
  @Override
  public void onComplete(DataSnapshot snapData, @Nullable String prevName) {
    // code to handle current data
    Log.i("VALUE", String.format("Mary %s a member of alpha group", snapData.val() == null ? "is not" : "is"));
  }
  @Override
  public void onCancel(WebcomError error) {
    // event subscription was canceled because of permission reason
  }
  @Override
  public void onError(WebcomError error) {
    // an error occured
  }
});
// see if Mary is in the 'alpha' group
var ref : WCWebcom = WCWebcom(url:"file://io./base/web/org/users/mchen/groups/alpha")!
ref.onceEventType(.value, withCallback: { (snapData, prevName) in
            if (snapData != nil){
                // code to handle current data
                print("Mary \(snapData?.exportVal() == nil ? "is not" : "is") a member of alpha group")

            } else {
                // an error occured
            }
}) { (error) in
    if (error != nil){
    // event subscription was canceled because of permission reason
    }
}

Consider the CPU cycles and bandwidth to perform this check without an index:

// see if Mary is in the 'alpha' group without an index (i.e. the painful way)
var ref = new Webcom("[[baseUrl]]/base/web/org/users/mchen/groups/");

// download the entire group list :(
ref.once('value', function(snap) {
    var result = 'is not';

    // iterate all the elements :((
    snap.forEach(function(ss) {
       if( ss.val() === 'alpha' ) {
          result = 'is';
          return true;
       }
    });

    console.log('Mary ' + result + ' a member of alpha group');
});
// see if Mary is in the 'alpha' group without an index (i.e. the painful way)
Webcom ref = new Webcom("[[baseUrl]]/base/web/org/users/mchen/groups/");
ref.once(Query.Event.VALUE, new OnQuery(){
  @Override
  public void onComplete(DataSnapshot snapData, @Nullable String prevName) {
    String result = "is not";

    // iterate all the elements :((
    for(DataSnapshot ss : snapData.children()){
      if(ss.val() == "alpha") {
        result = "is";
        break;
      }
    }
    Log.i("VALUE", String.format("Mary %s a member of alpha group", result));
  }
  @Override
  public void onCancel(WebcomError error) {
    // event subscription was canceled because of permission reason
  }
  @Override
  public void onError(WebcomError error) {
    // an error occured
  }
});
// see if Mary is in the 'alpha' group without an index (i.e. the painful way)
let ref : WCWebcom = WCWebcom(url:"file://io./base/web/org/users/mchen/groups/")!
ref.onceEventType(.value, withCallback: { (snapData, prevName) in
            if (snapData != nil){
                var result : String = "is not"
                // code to handle current data
                // iterate all the elements :
                let valueData = snapData?.exportVal() as!   [String : Any]
                for ( key , value ) in valueData {
                    if ( value as? String == "alpha"){
                        result = "is"
                        break;
                    }
                }
                print(" Mary \(result) a member of alpha group")
            } else {
                // an error occured
            }
}) { (error) in
            if (error != nil){
                // event subscription was canceled because of permission reason
            }
}

Thus, the index is faster and a good deal more efficient.