Tuesday, September 10, 2013

Polyglot persistence

Polyglot persistence is used often and it’s often mentioned in articles. Simply explained it means awareness of different types of persistence models and technologies, and using the right or better persistence model for the data that needs to be stored. In this definition there’s a distinction between traditional relational database models, object based database models, and the more intangible NoSQL concept.

I have a lot of experience in relational data storage, as do many others, from using Oracle, MySQL databases etc. and one could argue that this is still the most common method for storing data. However, not so long ago the NoSQL concept received a lot of attention and I have long been interested in this, so I’ll devote a post to the topic.

NoSQL is not as simple to define as RDMS which is essentially SQL. When I first read about it I didn’t get it and I’m still not sure if I see the big picture with NoSQL. Now, I’ve said that so you know that you can’t expect a write up with much detail. This post is for me to condense and gather some of my early insights into NoSQL, from a bird’s eye view.

One thing that I realized quickly is that NoSQL is tied to large clusters of databases, which is in contrast to relational databases. Not to say RDMS technologies can’t be clustered but often the system is based on central data storage and growing to a cluster of relational databases is very problematic. NoSQL is supposed to handle this much better.

So, NoSQL come in three major forms and each of these forms comes very close the type of data you need to store. Choosing which form is very important and is the whole point of polyglot persistence, to find the better storage model. The three models are:

·         Key/Value-pairs
·         Graph
·         Document

They are each different and all of them fall under the umbrella of NoSQL. All of them are very different to RDMS and that’s why NoSQL has had such a strong impact on new technology. In designing enterprise apps or systems today it would not be fair to assume that your persistence should be relational. It may the preferred option but awareness and familiarity of the other options will enforce the reasons for choosing relational persistence, after considering NoSQL. The gains of choosing the correct one may minimize development time and result in a better application so due diligence on the different options should be a priority. 

Data being stored
Persistence technology
 Applications
Financial data
Oracle, MySQL, SQL Server etc
Transactional updates, ACID
Reporting
Oracle, MySQL, SQL Server etc

User sessions
Redis
Rapid read/write access
Shopping cart
Riak, DynamoDB
High availability in multiple locations
Recommendations
Neo4J
Traverse links between friends, product and purchases
Product catalog
MongoDB
Many reads, infrequent writes
User activity logs
Cassandra
Large cluster, many writes on many nodes

The table is based on findings in a post by Martin Fowler I read recently and gives a suggestion to a scenario in which persistence technology may be applicable. For example a global e-commerce store may be using Riak or DynamoDB to serve customers in different parts of the planet. This is obviously not enough to make a decision on which technology to use for a specific application or in given scenario so I wanted to go just a little deeper. I wanted to find out more about the characteristics of the different technologies and in what scenario they can be considered.

Redis (Remote Dictionary Service) is basically a key-value storage in RAM with built in Persistence. Since it’s in RAM its extremely fast and suitable for quick read/write. Redis support data types (key and values) such as strings, hashes, lists, and sets. The string is the most basic type and the other types are actually containers of strings, there the characteristics of the string is important. The string may be up 512MB and can store images or serialized objects, and it is binary safe. The string and the other types for that matter may be used for keys or values, however there’s a recommendation not to make the key too large.
I’ve read or heard of several applications using Redit in one way or the other, for example a twitter like feed, a authentication store, a leaderboard, a roster with online/offline status, a note keeping app.
Another things about Redit is that is cross-platform, have clients in numerous languages so it seems to be a good choice if the application is intended for multiple platforms and if different programming languages are used.

Riak is also a key/value store. From visiting their website the technology is used by many well-known internet brands, for example the online retailer Best Buy. Its main purpose is real-time systems where availability is a high priority as well as scalability. It has a full-text search engine and some advanced indexing features which make latency low. Another use case is mobile apps, for example the find-a-taxi app Flywheel is said to use Riak.

DynamoDB is a database from Amazon and is part of the AWS (Amazon Web Services) suite. Being part of the AWS means it’s hosted by the amazon cloud and does integrate well with the many other services in the AWS infrastructure. This means that data is stored on SSD drives are easily replicated across the regional zones of AWS. DynamoDB is a table based data storage but the tables have no schema expect for the fact that each table have a primary key. The three concepts of tables, items are attributes are central to the datamodel.
To read data a Query or a Scan is used, where a Query for a primary key and the scan searches the entire table. Query and results are in JSON format.
The scalability of DynamoDB make it suitable for online portals with a large number of users. With it comes the full AWS infrastructure, including the power amazon management console.

Neo4j is a graph database, it is based on connections between nodes in a web where the connections or edges of the graph contain data. It has support for transactions and is robust according to ACID. Conceptually it may be thoughts of as a web of relations between people in the form of a graph and therefore I would expect have a big usage in Social Media. Recommendations based on your social network would be a suitable application where a graph query is efficient. Neo4j uses Cypher query language with some keywords taken from SQL but in general looks and is used quite differently.

MongoDB is a document database. Documents are stored in JSON like format and is rumored to be the most popluar NoSQL database. Perhaps this is due to its first release in 2007 which is very early in the history of NoSQL technology. Search queries can be made of fields, ranges or regular expressions.  Master-slave replication is one of the features of MongoDB. Suitable applications for MongoDB and document databases is where the concept of a document is central such as in a News Agency, I believe that many well know news providers uses MongoDB or other document databases.

Cassandra is a distributed database system with a key-value storage model. It used for storing large data as it scales horizontally with little effort. It been tested in huge data volumes up to hundreds of terabytes over hundreds of machines. If storing and consuming large volumes of then Cassandra may be the model of choice.



There are many more implementations of NoSQL databases available and if one is pondering using one the next thing after deciding whether to use a key/value, graph or document storage would be to try as many as possible. Things was a lot easier when RDSMS was the only options wasn’t it?

As a last sentence I’ll recognize Scott Leberknight who is believed by many to be the first person to use the term Polyglot Persistence in this article.

No comments:

Post a Comment