SQL Server Database Model Optimization for Developers
It's my assessment that most developers have no idea how much a poor database model implementation or implementation by a DBA unfamiliar with the data semantics can affect a system. Furthermore, most developers whom I have worked don't really understand the internals of SQL Server enough to be able to make informed decisions for their project. Suggestions concerning the internals of SQL Server are often met with extremely reluctance from developers. This is unfortunate, because it is only when we understand a system’s mechanics that we can fully optimize our usage of it. Some familiar with the history of Physics will recall the story of when Einstein "broke" space by his special theory of relativity. Before Einstein was able to "fix" space, he had to spend nearly a deciding trying to decipher how space worked. Thus was born the general theory of relativity.
It's not a universal rule, but I would have to say that the database model is the heart of any technical solution. Yet, in reality, the database implementation often seems to be one of the biggest bottle necks of a solution. Sometimes it’s a matter of poorly maintained databases, but from my experience it seems to mostly be a matter of a poorly designed implementation. More times than not, the SQL Server database model implementation has been designed by someone with either only a cursory knowledge of database modeling or by someone who is an expert in MySQL or Oracle, not SQL Server.
Database modeling does not end with defining entities and their respective relations. Rather, it extends completely into the implementation. What good is an amazing plan, if it is going to be implemented poorly? The implementation phase to which I am referring comes before the actual implementation, yet after what most people refer to as “modeling”. It’s actually not even a phase with a temporal separation, but is rather a phase that requires continual thought and input from the start about the semantic understanding of the real world solution. This phase includes things like data-type management, index design, and security. This phase is the job of the resident architect or senior level developer, not the job of the DBA. It needs to be overseen by someone who deeply understanding both SQL Server and the semantics of the solution. Most of the time the DBA needs to completely stay away from the original data model and focus more on the server specific tasks like monitoring backups and tweaking existing data models based on the specifications that an architect has previously documented. Having said this, I often find that it's not only not the architect or senior developer optimizing a project, often nobody even cares!
Developers need to start understanding that designing a proper data model based on the real world representation includes minimizing data usage, optimizing performance, and increasing usability (for the solution’s O/R mapping). These are not jobs for a DBA. Someone with close knowledge to the project needs to make these decisions. More times than not, a DBA simply does not have the understanding of the project required to make these important decisions. They should stay away from the requirements of the system, leaving this to the architect and senior-level developers. Despite what many well intentioned DBAs think, they do not own the data. They are merely assistants to the development team leaders.
Let's start off by looking at storage optimization. Developers should be able to look at their data model and notice certain somewhat obvious flaws. For example, suppose you have a table with a few million rows with each row containing multiple char(36) columns (a guid), two datatime columns (8-bytes each), six int columns (4-bytes each)-- two of which are foreign keys to reference/look-up/enumeration tables, and an int (4-bytes) column which is also table's primary key and identity. To optimize this table, you absolutely must know the semantics of the solution. For example, if we don't care about recording the seconds of a time, then the two datetime columns should be set to be smalldatetime columns (4-bytes each). Also, how many possible values could there be in the non-foreign key int columns? Under 32,727? If so, then these could easily be smallint columns (2-bytes each).
What about the primary key? The architect or senior-level developer should have a fairly good estimate on how large a table will ever become. If this table is simply a list of categories, then what should be do? Often the common response is to convert it to a tinyint (1-byte). In reality, however, we shouldn't even care about size of the primary key. It’s completely negligible; even if there were only 100 rows, switching it to a tinyint could cause all kinds of problems. The table would only be marginally smaller and all your O/R mappers are now using an Int16 instead of an Int32, which could potentially cause casting problems in your solution. However, if this table tracks transactions, then perhaps you need to make it a bigint (8-bytes). In this case, you need to put force a strong effort to making sure that you have optimized this table down to its absolutely raw core as those bigint values can add up.
Now, what about the foreign keys? If they are simply for reference, then the range of values probably isn't really that wide. Perhaps there are only 5 different values against which the data is to be constrained. In this case, the column could probably be a tinyint (1-byte). Since a primary key and foreign key must be the same data type, the primary key must also become a tinyint (1-byte). This small change alone could cut your database size by a few hundred MB. It wasn't just the foreign key table that dropped in size, but the references between the two tables are now smaller as well (-- I hope every now understand why you need to have a very good reason before you even think about using a bigint foreign key!) There's something else to notice here as well. Reference tables are very helpful for the developer to look at the raw data, but does there really need to be a constraint in the database? If the table simply contains an Id and Text column with only 8 possible values, then, while the table may be tremendously helpful for documentation purposes, you could potentially break the foreign key constraint and put the constraint logic in your application. However, keep in mind that this is for millions or possibility billions of rows. If the referencing table contains only a few thousand rows or if space doesn’t have a high opportunity cost, which may be the case if the solution is important enough to actually have that many rows in the first place, then this optimization could cause more problems than it solves. First off, your O/R mapper wouldn’t be able to detect the relation. Secondly, obviously, you wouldn’t have the database level constraint for applications not using the solution’s logic.
Another optimization that’s important is performance optimization. Sometimes a table will be used in many joins and will be used heavily by each of the CRUD (Create, Retrieve, Update, Delete) operations. Depending on how important the column is, you may be able to switch a varchar(10) to a char(10) . The column will allocate more space, but your operations may be more efficient. Also, try to avoid using variable length columns (varchar) as foreign keys. In fact, try to keep your keys as the smallest integer type you possibly can. This is both a space and performance optimization. It's also important to think very carefully about how the database will be accessed. Perhaps certain columns need extra indexes and others need less. Yes, less. Indexes are great for speeding up read access, but slow down insert operations. If you add too many indexes, your database inserts could run your system to a crawl and any index defragmentation could leave you with a painfully enormous transaction log or a non-functioning SQL Server.
This is exactly what happened to a company I worked with in 2005. Every Saturday night for several weeks in a row, the IT team would get an automated page from their monitoring service telling them that all their e-commerce web sites were down. Having received the phone call about 2AM, I looked into a few things and noticed that the transaction log had reached over 80GB for the 60GB database. Being a scientist who refuses fall into the post hoc ergo proctor hoc fallacy, I needed measurements and evidence. The first thing I did was write a stored procedure that would do some basic analysis on the transaction log by pulling data from the fn_dblog( ) function and doing a simple cube and save the results into a table for later review. Then I told them that the next time the problem occurred they were to run the stored procedure and call me the next Monday (a polite way of telling them that I’m sleeping at 2AM on Saturdays). Exactly one week later the same thing happened and the IT department ran the stored procedure as instructed (and, yes, waited to Monday to call me, for which I am grateful). Looking over the stored analysis data, I noticed that there were a tremendous number of operations on various table indexes. That gave me the evidence that I needed to look more closely at the indexes of each of the 5,000+ tables (yes, that’s four digits—now you know why I needed more information). After looking at the indexes, I realized that the database was implemented by someone who didn’t really understand the purpose of indexing and who probably had an itchy trigger finger on the Index Tuning Wizard. There were anywhere from 6 to 24 indexes on each table. This explained everything. When the weekly (Saturday at 2AM) SQL Server maintenance plan would run, each of the indexes were defragmented to clean up the work done by high volume of sales that occurred during the week. This, therefore, caused a very large number of index optimizations to occur. Each index defragmentation operation would be documented in the transaction log, filling the transaction log’s 80GB hard drive, thereby functionally disabling SQL Server.
In your index design, be sure to also optimize your index fill factors. Too full and you will cause a page split and bring your system to a crawl. Too empty and you're wasting space. Do not let a DBA do this. Every bit of optimization requires that a person deeply knowledgeable about the system to implement a complete database design. After the specifications have been written, then the DBA can get involved so that he or she can then run routine maintenance. It is for this reason that DBAs exist. For more information on the internals of SQL Server, see the book Inside SQL Server 2005: The Storage Engine by Kalen Delaney (see also Inside SQL Server 2000). This is a book which should be close to everyone who works with SQL Server at all times. Buy it. Read it. Internalized it.
There’s still more to database modeling. You want to also be sure to optimize for usability. Different O/R mapping solutions will have different specific guidelines, but some of the guidelines are rather general. One such guideline is fairly well known: use singular table names. It's so incredibly annoying to see code like "Entries entry = new Entries( );" The grammar just doesn't agree. Furthermore, LINQ automatically inflects certain tables. For example, a table called “BlogEntry” will be related to the LINQ entity “BlogEntry” as well as “BlogEntries” in the LINQ data context. Also, be sure to keep in mind that your O/R mapper may have special properties that you’ll want to work around. For example, if your O/R mapper creates an entity for every table and in each created entity there is a special "Code" property for internal O/R mapper use, then you want to make sure to avoid having any columns named "Code". O/R mappers will often work around this for you, but "p.Code_" can get really confusing. You should also consider using Latin-style database naming (where you prefix each column with its table name—so-named because Latin words are typically inflected with their sentence meaning thereby allowing isolated subject/object identification), this is not only a world of help in straight SQL joins, but just think about how Intellisense works: alphabetically. If you prefix all your columns with the table name, then when you hit Ctrl-J you’ll have all your mapped properties grouped together. Otherwise, you’ll see the "Id" property and it could be be 10-15 internal O/R mapper properties before you find the next actual entity column. Doing this prefixing also usually alleviates conflicts with existing O/R mapper internal properties. This isn't quite as important for LINQ, but not having table-name prefixed columns in LLBLGen Pro can lead to some headaches.
Clearly, there's more to think about for database design than the entity relation diagramming we learn in Databases 101. It should also be clear that your design will become more optimized the more specific it becomes. A data model designed for any database server probably won’t be as efficient as a well thought out design for your specific database server and for your specific solution requirements. Just keep in mind that your database model is your applications view of reality and is often the heart of the system and it therefore should be optimized for space, speed, and usability. If you don't do this, you could be wasting many gigabytes of space (and therefore also hundreds of dollars of needless backups), have a painfully inefficient system, and have a hard time figuring out how to access the data in your application.