The unstructured data hype of the last few years seems to be abating and we are are increasingly finding clients wanting to implement structured datasets. This is primarily because media data does fit relatively simply into a taxonomy: the consumer has various attributes and is consuming media which also has many different attributes.
The previous approach of dumping large volumes of unstructured data into a Hadoop cluster because “it might be useful later” is providing both expensive, inefficient, and provides a significant security risk.
This blog covers the current state of the art options
SQL database options
Although there are many options available for large-scale data warehouses, the most common implemented at scale are:
- SQL Server
Amazon’s Redshift provides a simple-to-deploy, secure, and fast data warehouse infrastructure. It is highly cost effective when deployed correctly but it is hard to find experienced staff who know how to correctly deploy it. Particularly troublesome areas are in establishing distribution and sort keys and choosing the correct architecture of the different nodes in the product. Redshift supports storage and querying of JSON object data within referential tables enabling the most efficient data structures.
Hive is a very popular addition to provide SQL functionality in a Hadoop environment and it is also available from AWS as part of the AMR product. It is fast and cost effective when Hadoop infrastructure is already available or used for other purposes but in a pure AWS environment is probably less effective than Redshift.
Hive supports storage and querying of JSON object data within referential tables.
Oracle supports storage and querying of JSON object data within referential tables and the Exadata product provides a columnar database with strong performance against both Redshift and Hive in an on-premises model. We would only really recommend Oracle as a platform for a data lake for an organization that already has a strong internal DBA function familiar with the Oracle platform.
SQL server is Microsoft’s relational database and can be deployed both on premises and within the cloud. It is very much a traditional database and supports columnar storage.
We would only really recommend SQL server for an organization that has strongly aligned their technology roadmap to the Microsoft platform.
SQL Server does not support storage and querying of JSON object data within referential tables
NOSQL database options
NOSQL databases provide massive data stores with alternative query languages for accessing the data and are typically used when developing applications rather than for a data analytics data store.
Key examples are:
- Google BigTable – the database behind google search, analytics, and gmail can handle massive workloads in realtime and is provided by Google as a service through their cloud
- Hadoop HBase – provide high performance read/write of large amounts of data but minimal queries of data
- Cassandra – provides scalability and high availability with good performance with excellent cross-regional replication and query performance comparable with a relational database.