recent time, as we could see, the amount of data that is needed to be processed
has grown significantly. If a decade ago the average statistical database
contained up to a few million records, then with the development and
dissemination of the Internet, it became necessary to create databases with
hundreds of millions or even billions of records.
increase in the volume of the database, a processing problem arose. With a
large database distribution between servers and many tables, the search time of
the required record is strongly increasing. Plus, since large databases are
used on sites with millions of visitors, the number of one-time calls can reach
several thousand. Each new conversion will come faster than the processing of
the previous one. Thus, the database servers will quickly get killed, and all
this will lead to a denial of work. It becomes obvious the need for the refusal
of the relational databases and the transition to another methodology for data
storage. One of such methodologies is NoSQL.
a concept that involves the use of non-relational data models and the ability
to scale horizontally (allocating a database for a very large number of domains
should not affect the processing speed). For the first time the term NoSQL was
used in 1998 by the Italian software developer Carlo Strozzi, and then still
meant a relational database with open source code that did not use the SQL
language. In the modern sense, the term NoSQL has been used since 2009.
are two main reasons:
development of applications. The development of many applications requires
effort to display data structures that are, for example, in RAM. Thus, we can
obtain the solution, which will be solved the phenomenon of Impedance Mismatch.
NoSQL databases offer a data model that better meets the needs of your
application, resulting in easier interaction with the database. And it means
that you need to write your code shorter, requiring less debugging. On this way
a DB needs a smaller amount of changes
amounts of data. Organizations considered valuable to have in its available as
much information and rapid addition of processes, which in the case of a
relational database is expensive, not to mention the fact, whether it is at all
possible to provide. The main reason for this is that relational database
designed to work on the same machine, while more economical to work with a
large database and to distribute the load on clusters of many smaller and
cheaper machines. Most of the NoSQL databases are designed just to run on
clusters, therefore they are better suited to work with a large amount of
NoSql databases represent a promising technology that allows the manipulation
of huge amounts of data distributed among servers.
What Is a
Key-value stores are the simplest NoSQL data stores to use
from an API perspective. Main idea of KV method is pairing of key and value.
Value are stored in a blob, without data store knowing, what is inside. The
client can either get the value for the key, put a value for a key, or delete a
key from the data store. It’s application’s responsibility to understand what
Since key-value stores always
use primary-key access, they generally have great performance and can be easily
Figure 1 – typical example of Key-Value
The key-value model is one of the simplest non-trivial
data models, which is used by more complex data models to be implemented as an
extension of it. The KV model can be extended to a discretely ordered model
maintaining keys in lexicographic order. This computationally powerful extension
can efficiently retrieve selective key ranges.
characteristics between traditional RDBMS and Key-Value Store
databases and repositories of key values differ radically and are used to solve
various problems. Comparing the characteristics allows us only to understand
the difference between them.
Comparison of characteristics will allow understand
the difference between them:
database consists of tables, tables contain columns and rows, and rows
consist of values of columns. All rows in one table have the same structure.
For domains you can draw an
analogy with tables, but unlike the tables for domains is not determined by
the structure of the data. Domain is a box into which you can put anything
you like. Records within the same
domain can have different structures.
data model is defined in advance. Is strongly typed, contains constraints and
relations to ensure integrity of data.
Identification of records
takes place using the key, wherein each entry record has a dynamic set of
attributes associated with it.
data model is based on the natural representation of the contained data, not
of the functionality of the application.
In some implementation, the
attributes can only be strings. In other implementations, the attributes have
simple data types that reflect the types used in programming: integers, array
of strings, and lists.
data model is normalized to avoid data duplication. Normalization creates
relationships between tables. Relationships between tables connect data in
Between domains, as well as
within the same domain, the relationship is not explicitly defined.
of data access between traditional RDBMS and Key-Value Store
is created, updated, deleted and queried using structured query language
Data is created, updated,
deleted and queried using a call to the API methods.
queries can extract data from single table or from multiple tables using
Some implementations provide
a SQL-like syntax to specify filter conditions.
queries can include aggregation and complex filters.
You can often use only the
basic operators comparison (=, !=, , ).
relational database usually contains built-in logic, such as triggers, stored
procedures and functions.
All business logic and logic
to support the integrity of data contained in the application code.
Comparison of interaction with
applications between traditional RDBMS and Key-Value Store
commonly used private APIs, or generalized, such as OLE DB or ODBC.
The most commonly used SOAP
and / or the REST API, by means of which the access to the data.
data is stored in a format that reflects their natural structure, so you need
mapping of application structures and relational database structures.
Data can be displayed more
effectively in the application structure, only the code needs to write data into
advantages of Key-Value storage
are two distinct advantages of such systems to relational DB:
are very suitable for cloud services. The first advantage of key-value storage
is that they are easier, and thus have greater scalability than relational
databases. If you put together your own system, and plan to place dozens or
hundreds of servers that need to cope with the increasing workload for your
data store, then you have to choose – key-value stores. Since this storage is
easily and dynamically expand, they are also useful for vendors who provide
multi-user storage web platform. Such a framework is relatively low-cost means
of storing data with a lot of potential for scalability. Users typically pay
only for what they use, but their needs may grow. The vendor will be able to
dynamically and virtually no restrictions to increase the size of the platform,
based on the load.
more natural integration with the code. The relational data model and object
model of code are usually constructed in different ways, leading to some
incompatibilities. The developers solve this problem by writing the code that
displays the relational model to an object model. This process does not have
clear and achievable values quickly and can take a lot of time that could be
spent on the development of the application itself. Meanwhile, many key-value
data in such a
structure that appears in objects more naturally. This can significantly reduce
disadvantages of Key-Value storage (the advantages of Relational DB)
in a relational database to ensure data integrity at the lowest level. Data
that do not satisfy the constraints are physically unable to get to the base.
In storages of key-value there are no such restriction, so data integrity
monitoring is fully based on the application. However, in any code has bugs. If
the errors in a properly designed relational database usually don’t lead to
data integrity issues, errors in the storages of key-value storages will
usually lead to such problems.
Another advantage of relational databases is that they force you to go through
the process of developing a data model. If you have a well-developed model, the
database will contain a logical structure that fully reflects the structure of
the stored data, but at odds with the structure of the application. Thus, the
data become independent of the application. This means that another application
can use the same data and application logic can be changed without any changes
in the database model. To do the same thing with the key-value storage, you
need to replace the process of designing the relational model design classes in
which are general classes, based on the natural data structure.
relational databases, repositories are targeted for use in the
“cloud”, are much less common standards. Although conceptually they
are not different, they all have different the API, query interfaces and
specific. Therefore, you’d better trust your vendor, because if something
happens, it will be not so easily switch to another service provider. And given
the fact that almost all modern key-value storages are in beta versions, trust
is even riskier than in the case of relational databases.
Store Features on Riak example
of NoSQL data stores requires an understanding of features compatibility
between itself and the standard RDBMS data stores, which also used by us. The main
point is to understand what features NoSQL are lacking and what changes must be
done to the application architecture for more effective use of a key-value data
store and its features. Some common features of NoSQL data stores we will
discuss here are consistency, transactions, query features, structure of the
data, and scaling.
applies only for a single-key operation. These are either a get, put, or delete
on a single key. Optimistic writes are very cost-expensive because data store
itself cannot determine a change in value.
In distributed key-value stores
(Riak, for example) implemented the eventually
consistent model of consistency. Since the value may have already been
replicated to other nodes, Riak has two ways of resolving update conflicts:
either the newest write wins and older writes lose, or both (all) values are
returned allowing the client to resolve the conflict.
Riak, these options can be set up during the bucket creation. Buckets are just
a way to namespace keys so that key collisions can be reduced. Let’s assume
that all customer keys reside in the customer bucket. When creating a bucket,
we can provide default consistency values, such as “write is considered good
only when the data is consistent across all the nodes where the data is stored.”
Bucket bucket = connection
guarantee that data in every node is consistent, we can increase the numberOfNodesToRespondToWrite set by w to be the same as nVal. Of course, doing that will decrease the cluster’s write
performance. We can change the allowSiblings flag during bucket creation for
some improvement on write or read conflicts. If the flag is set to false, store
will let the last write to win and not create siblings.
products have different specifications of transactions, but, in general there
are no guarantees on the writes. Many data stores do implement transactions in
different ways. Riak uses the concept of quorum implemented by using the replication
factor during the write API call.
assume we have a Riak cluster with a replication factor of 5 and we supply the numberOfNodesToRespondToWrite (W) value of 3. It means that Riak will
have tolerance of N – W = 2. So, up two nodes can be down, and data store still
will succeed on write operation, though we would have lost some data on those
two nodes for read.
name implies, all key-value stores can query by the key. When query uses some
attributes of the value column, it’s not possible to use the database only, an
application must read the value to check it out for validity.
is an interesting side effect: most of the data stores will not return a list
of all their primary keys. And even if they did, cost of retrieving lists of
keys and later querying for the values would be quite excessive. Some key-value
databases compensate this by searching inside the value, as it implemented in Riak
Search tool. That allows user to query the data just like when using indexes.
using key-value stores, lots of thought must be given to the design of the key.
Can the key be generated using some algorithm? Can the key be provided by the
user (user ID, email, etc.)? Or derived from timestamps or other data that can
be derived outside of the database?
query characteristics make key-value stores likely candidates for storing
session data (with the session ID as the key), shopping cart data, user
profiles, and so on. The expiry_secs property can be used to expire keys after
a certain time interval, especially for session/shopping cart objects.
When writing to the Riak
bucket using the store API, the object is stored for the key provided.
Similarly, we can get the value stored for the key using the fetch API.
Riak provides an HTTP-based
interface, so that all operations can be performed from the web-browser or on
the command line using curl. Let’s save this data to Riak:
Use the curl command to POST
the data, storing the data in the session bucket with the key of a7e618d9db25 (must
provide this key):
databases don’t care what is stored in the value part of the key-value pair.
The value can be a blob, text, JSON, XML, and so on. In Riak, we can use the
Content-Type in the POST request to specify the data type.
Sharding is a methodology
of backing up data by duplicating it in discrete storages (shards). Most of KV
stores can be scaled with sharding. The value of the key determines on which
node the key is stored, so, assuming we are sharding by the first character of
the key, if the key starts with an z, it will be sent to different node than
the key starting with b. This way of sharding increases performance because
more nodes are added to the cluster.
have some downsides, though: if the node used to store z-keys’ values goes
down, all z-keyed data becomes unavailable, nor can new data be added with keys
that start with z.
stores such as Riak allow control of the aspects of the CAP Theorem: N is number of nodes storing the
key-value replicas, R is number of nodes
that must have the data being successfully fetched for read to be considered valid,
and W is the number of nodes that
must be written to before write is considered successful.
we have a 5-node Riak cluster, if N=3 means that all data is replicated to at
least three nodes. R=2 means any two nodes must reply to a GET request for it
to be considered successful. W=2 ensures that the PUT request is written to two
nodes before the write is considered successful.
settings allow us to fine-tune node failures tolerance for read or write
operations. Based on specific data store, these values can be changed for optimization
of read availability or write availability.
discuss some of the problems where key-value stores are a good fit.
Storing Session Information
every web session is unique and is assigned a unique sessionID value.
Applications that store the sessionID on disk or in an RDBMS will greatly benefit
from moving to a key-value store, since everything about the session can be
stored by a single PUT request or retrieved using GET. This single-request
operation makes it very fast, as everything about the session is stored in a
User Profiles, Preferences
every user has a unique userId, username, or some other attribute, as well as
preferences such as language, color, timezone, which products the user has
access to, and so on. This can all be put into an object, so getting
preferences of a user takes a single GET operation. Similarly, product profiles
can be stored.
Shopping Cart Data
websites have shopping carts tied to the user. As we want the shopping carts to
be available all the time, across browsers, machines, and sessions, all the
shopping information can be put into the value where the key is the userID.
not to use
are problem spaces where key-value stores are not the best solution.
1. Relationships among
you need to have relationships between different sets of data, or correlate the
data between different sets of keys, key-value stores are not the best solution
to use, even though some key-value stores provide link-walking features.
2. Multioperation Transactions
you’re saving multiple keys and there is a failure to save any one of them, and
you want to revert or roll back the rest of the operations, key-value stores
are not the best solution to be used.
3. Query by Data
you need to search the keys based on something found in the value part of the
key-value pairs, then key-value stores are not going to perform well for you.
There is no way to inspect the value on the database side, except for some
products like Riak Search or indexing engines like Lucen or Solr.
4. Operations by Sets
operations are limited to one key at a time, there is no way to operate upon
multiple keys at the same time. If you need to operate upon multiple keys, you must
handle this from the client side.
stores are most suitable for storing a large number of poorly structured data
that assume distribution among several domains. That is, such repositories are
suitable for sites with a very large number of visitors. Also, such a data
store should be selected if the data should be object-oriented, or can have
Although the main disadvantage of such storage
facilities is the lack of the NoSQL standard, in the near future the standard
can be adopted, and it will be more convenient to operate with this type of