Most likely the service you are providing is heavily relying on
persistent data, some (or most) user-generated, often quite structured so that relationships can be drawn between them in a fairly automated fashion. When the raw datum is not structured you can structure its meta-data. For example, consider the various multimedia repositories (
flickr,
youtube,
dailymotion, etc.) The raw datum (photo, video) does not have much in the way of structure hence the emphasis its meta-data: tags, location, user ownership, etc.
Isolating the right amount of structure to drive your data model is key to correctness, performance and scalability because it is really a translation of your analysis of the problem at hand. Fortunately that's really the hard part of the whole process. Once you have a data model that seems to fit your use cases, you can delegate most of the implementation details to a database.
I am going to consider relational databases here because they cover a good number of use cases by virtue of being built on top of set theory. They require a discrete (and generally low) number of types in your model, each type being possibly represented by a large number of identifiable instances. To be sure some problems are not easily represented as relations (e.g. highly heterogeneous data, such as data commonly stored as files on a computer) or are not easy to optimize in the relational model (e.g. graphs). But the common stuff lends itself well to using relational databases.
You will find that time spent in learning relational databases is time well invested. A relational database
- frees your mind from worrying too much about storage specifics
- lets you code your application in a portable way
- offers excellent performance
So if you find yourself implementing sorting, filtering of large data sets in your application code, you are not properly using your database, at least until you have reached a respectable volume of data (more on this in later posts).
The typical misuse of the database involves something like this:
- load items in your business logic layer with minimal filtering, one by one
- for each entity, traverse the rest to recreate relationships
- discard all when finished
- observe rapidly degrading runtime performance as the number of items grows
Object-relational mappers (ORM) tend to favour this kind of excess, by re-inventing a watered-down, clumsy query language that is half as expressive as good ole' SQL. Except for the simplest queries of your data set I recommend going the other way round:
start from your relational queries and build objects from them.
Your throughput is going to scale much more easily with the number of items than if you recreate the same steps in your business logic code. Not too mention the correctness that a normalized model buys you for free, or the ability to operate on data in a transactional manner. There is a limit to this linear growth of course but until you outgrow fast and (reasonably) cheap machines such as the
Sun x4600 M2, your performance will scale. Why is a relational database likely to scale better than your code? Because a large number of smart people have spent the last 30-some years optimizing the very same operations that you perform on your data (projections, filters, updates, creations).
If the excellent
SQL Cookbook from my friend Anthony does not convince you that SQL can be an elegant approach to data problems, I don't know what will.