Best Practice Data Pipeline Architecture on AWS in 2018

Best Practice Data Pipeline Architecture on AWS in 2018

Clive Skinner, Fri 06 July 2018

Last year I wrote about how Deductive makes the best technology choices for their clients from an ever-increasing number of options available for data processing and three highly competitive cloud platform vendors. As an Amazon Web Services Consulting Partner, Deductive has been recognized by AWS for their ability to design, architect, build, migrate, and manage data services on the Amazon cloud. Therefore, in this post I want to examine more deeply the AWS services we deploy most often for clients looking to migrate their data workloads to the cloud.

The role of data architecture is to gather and prepare information in a form that allows data scientists to perform their tasks quickly and efficiently. As the data sets our clients use are often extremely large, scalability and performance of the data architecture are critical. For this reason we tend to favor ‘serverless’ deployments. Not only does this provide scalability and resilience but it also significantly reduces system administration costs over the lifetime of the deployment.

A fully-featured AWS data pipeline architecture deployed by us might look something like this:

Standard data pipeline architecture

Simpler deployments might only include a subset of the above features, but in each case the resources used will have been carefully chosen for their intended application:

Amazon S3
Highly durable and available cloud object store.
  • Import/export
  • Interim storage
  • Data lake
Often used for data delivery (import and export) as well as for storing interim data during data processing. Can be used as data lake for storing large amounts of structured or unstructured data.
AWS Lambda
Serverless environment to run code without provisioning or managing servers.
  • Data validation
  • Enrichment
  • Extraction and transformation
  • Cleansing and tokenizing
  • Delivery
We typically use Python as the runtime environment. Execution times are limited to maximum of 5 minutes although compute power and memory can be significantly increased if required. Long running or asynchronous processing is usually better done with AWS Batch.

We also use our Data Pipeline API for cleansing and tokenizing.
AWS Batch
Dynamically provisioned compute resources for running large numbers of batch jobs.
  • Batch processing
  • Machine Learning
  • Data enrichment
Whilst not ‘serverless’ in the generally understood sense AWS Batch does do all the provisioning and scaling of compute resources automatically, allowing jobs to be efficiently scheduled and executed with minimal administration. Using a Lambda-like container we schedule jobs in much the same way as the Lambda service does - with the advantage that they can run for as long as we like.
AWS Glue
Managed extract, transform and load service.
  • Data discovery and cataloging
  • Pre-processing
  • Loading
Data catalogs generated by Glue can be used by Amazon Athena. Glue jobs can prepare and load data to S3 or Redshift on a scheduled or manual basis . Underlying technology is Spark and the generated ETL code is customizable allowing flexibility including invoking of Lambda functions or other external services.
Amazon Athena
Interactive query service allowing analysis of data in S3 using standard SQL.
  • Extraction and transformation
  • Data lake
  • Analytics
Integrated with AWS Glue catalogs. Serverless and no data warehouse needed so no ETL required. We use this a lot for one-off analysis of large or small data sets that would otherwise require a lot more time and infrastructure to analyse using more conventional means.
Amazon Redshift
Managed data warehouse allowing complex analytic queries to be run using standard SQL and Business Intelligence tools
  • Data warehouse
  • Transformation
  • Analytics
Redshift’s big attraction is its fast and consistent query performance, even across extremely large data sets, as data load scales linearly with cluster size. An added bonus is the ability to create temporary Spectrum tables to query data in S3, allowing us to easily perform transformation and loading of data within Redshift itself.
Amazon Redshift Spectrum
Direct running of SQL queries against large amounts of unstructured data in S3 with no loading or transformation.
  • Extraction and transformation
  • Data lake
  • Analytics
As we’ve shared in a previous article, we sometimes use Redshift Spectrum in place of staging tables, avoiding the need to physically load data into Redshift before transformation. Also, as with Athena, Redshift Spectrum is an efficient way of quickly performing one-off analysis of large or small data sets without the need to load them.
API Gateway
Easy to create, maintain, monitor and secure APIs at any scale.
  • Import/export
  • Data delivery
  • Analytics
API Gateway can be used to provide a RESTful style API into a data lake or warehouse. It can query data in real-time using Lambda or Athena. For longer queries (API Gateway has a 30 second timeout) we use asynchronously invoked Lambda calls or AWS Batch behind the API. Our Pipeline API cleaning and tokenising service is implemented in this way.
Amazon Simple Queue Service
Managed message queueing service
  • Batching
  • Decoupling
  • Storage
The real pipe of the data pipeline. Used primarily to decouple different services to improve scalability and reliability. Also useful for batching streamed data. Polled service as opposed to SNS which is subscription.
Amazon Simple Notification Service
Managed publish/subscribe notification service
  • Data delivery
  • Alarms
  • Decoupling
As well as the obvious use of distributing alarms and other status notifications this service can also be used to deliver data to clients (e.g. using HTTPS subscription) and sending batches of data for processing by Lambda functions.
Amazon DynamoDB
Fast, flexible, non-relational database with low latency and automatic scaling.
  • Logs
  • Audit
Depending on the application we sometimes find that storing logs and status in DynamoDB gives us greater flexibility for report generation and service monitoring. Its low latency and autoscaling mean there’s practically no administrative overhead and the Time-To-Live feature means we can expire (or archive) old data automatically.
Amazon CloudWatch
Monitoring service for AWS cloud resources and applications.
  • Alarms
  • Metrics
  • Logs
Custom metrics allow us to monitor performance of all parts of our data architecture with alarms generated as appropriate for significant events. Log files allow for easy debug and basic monitoring of services where a little more detail is required.
AWS CloudTrail
Allows governance, compliance, and auditing of AWS resources
  • Audit
A much more rigorous form of audit than CloudWatch, this service allows logging, continuous monitoring, and retention of all account activity across your AWS infrastructure.
Amazon Glacier
Low-cost, durable and secure storage service for data archiving.
  • Archive
We use this service for long term archiving of logs and data, usually for audit purposes. Objects in S3 can be configured to be automatically archived to Glacier after a set period of time, significantly reducing storage costs.
AWS CloudFormation
A common format to describe and provision all AWS cloud infrastructure
  • Provisioning
  • Configuration
We love CloudFormation! Imagine being able to define and deploy an entire data centre from a single text file - that’s what it can do for you. The most powerful aspect is having your cloud infrastructure under version control and being able to deploy and update it from your CI/CD pipeline. It allows us to automatically spin up development, test and staging instances of all of our deployments and run thorough automated test cycles before we even consider going live with any changes to our production deployments.

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about aws

Deductive is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California - 94901
+1 (415) 843 1774

Registered in Delaware

Thames Tower
Station Road

Registered in England & Wales, number 8170657