Schema-Less Schema's – Enterprise-grade no-SQL Schema Design for Dummies

Posted by:

Date: 5 June 2012

No comments

LinkedInTwitterFacebookGoogle+Email

Most end-to-end business management software, the kind of tools that companies of all sizes run their entire business on, or ERP systems for short, contain 5,000 to 25,000 tables. Some of this is plain silly (if you are Oracle or Infor and acquired close to 100 business software companies, you can be fairly sure the same business data is stored 20 which-ways) and some of this is a necessary evil of trying to shoe-horn today’s complex business models into something as inflexible as an SQL RDBMS. The team at iBE.net is building a new ERP system for companies too big for Quickbooks, but who can’t afford SAP or Oracle, with just 500 collections. How can 25,000 go into 500? Surely the team at iBE.net is smoking something? Read on, and I will share some of their secrets.

First of all you’ve got to unlearn everything you know about SQL, and brush-up on your object-orientation. MongoDB, like other no-SQL databases, does not use flat and inflexible “tables,” but is built from rich and flexible “collections.” Each collection can store anything from deep objects with multiple layers of sub-documents and included reference data to variable documents with different fields from one record to the next. In theory, you could put every piece of data into one collection. So the first thing in enterprise grade schema design is to decide:  what is a collection? Common sense prevails; the best thing you can do is put one business object into one collection, where a business object is what an end user understands as a document they will interact with. So an employee, with all their recruitment, performance appraisal, grading, benefits, work assignment, and balances would be one object/collection. A warehouse bin, complete with stock summaries and cycle count history might be another object. Each of these collections would require 10-20 separate SQL tables to model, so now you can immediately see 25,000 tables is reduced to 1,500 objects.

The second thing is enterprise grade schema design is normalization. Do you follow the classic SQL route and avoid duplicating any data across tables, using references (foreign keys) instead? Or do you just copy everything to where it is needed, even if this means the data is in five places at once? Let’s take an example to illustrate. My name is Richard Minney and my contact record is a key object within any business management software. I will be literally all over the database, responsible for this project, assigned to that task, approving the other, supporting a quote, attending a meeting. How am I identified? As “Richard Minney” or via the mongo ID of my contact document? Well it turns out that there are pros and cons, and the answer is not so easy. Basically, if I am identified via my mongo ID then you need a reference or join, and you need to read two documents each time to work out the name of the person responsible, assigned, approving etc. Twice (or more when you add in time to merge the two datasets) the time to read, ugh! But if I am identified via my name then what happens if I get married? Okay, I’m a guy, so my name probably won’t change, but a woman’s might. You’d have to go find all the occurrences of the name “Richard Minney” and change it, including within some documents that might have been closed ages ago. Plus the risk of such updates not working properly, or taking a long time to execute…

So it turns out that the decision to normalize vs. de-normalize rests on the ratio of reads vs. updates. If my name is changing every few weeks then you’d be crazy to de-normalize that data and spread wrong or outdated information throughout the system. But if it changes once in a blue-moon then it is not only okay to de-normalize, but will simplify and speed up reads. The catch is that there is no one-size-fits-all, you have to make this decision on a case by case basis, element by element, object by object. If you considered just the full name (or knownAs) to be important the decision here would fall in favor of copying the data, but you have to consider elements in tandem. Often-times users don’t just want to know the name of who is responsible for their business process, but that person’s status, contact information, network, and so on. Depending on how often users want to see this additional information you might be doing two reads anyway, so you might as well bite the bullet here and store mongo ID for each person where it is assigned, with a contacts collection to hold their (rich and deep) information.

The third (and final) thing is the lack of a schema. This is both a blessing and a curse. Schemas constrain what you put into the database, something MongoDB skips around. So avoiding a schema means, for example, in a multi-tenant cloud-based business management software product clients can individually add custom fields, to the same database/collection, without impacting any other clients. Wow! SAP and Oracle can’t do that, except via some clunky field name/value pair classification concept which stores the data inefficiently, or even worse, by providing “four string, two number, two date and two checkbox fields on every table for clients to rename!! But having no schema means meta-data about your business model has to be defined in the client or server-side business logic, and you will inevitably end up building a customized schema-of-sorts either in the client or the server. Let’s illustrate with another example…I want to automatically convert userID’s to upper case (so richardminney and RichardMinney are not two distinct users), and I want age to max out at 100, and I want two decimals for currency fields (actually two decimals for most currencies, no decimals for silly-money like Italian Lira or Japanese Yen). This is meta-data of sorts and you have fundamentally three options:

1. In an SQL database some of this meta-data is modeled in the database, but it is woefully inadequate to control some user interface properties, or

2. In a no-SQL database you can let developers decide how to manage it in a chaotic sort of way…this is fine for small social apps or point solutions, but won’t cut the mustard for enterprise grade business management software, or

3. You can define your own schema, something much more powerful than an SQL schema because it can extend to user interface properties and is entirely in meta-data, so it still allows individual multi-tenanted clients to add an unlimited number of custom fields.

So that is what the team at iBE.net did, they defined a custom schema. The hard part is knowing when to tell the schema to get the hell out of the way, and when to use it. But it turns out once you invest some real brain-power into the schema design it is not that complicated, not that expensive to build, and it is really useful. In fact it shoots a torpedo through the hull of any SQL database claiming object stores like MongoDB lack robustness for enterprise-grade mission critical data. That, and combining multi-collection updates into a single virtual commit with roll-back/roll-forward (also called “multi-phase commit” or “ACID across more than one object”). But that’s a topic for another blog.

So how do you get down from 1,500 tables to the final 500 collections? The answer to that is through re-use. It turns out that large software companies have lots of teams of developers often all over the world duplicating, competing, and replicating each other’s work. By taking a “clean sheet approach” to designing an ERP system for today’s world, with mobile, cloud, and social collaboration at its core, you can avoid duplication and cut the number of actual objects required to one third. That’s what the folks at iBE.net think!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>