With increasing demand for data engineers, it is becoming harder to recruit staff who can manage and support a data pipeline. Although cloud providers like AWS have simplified the technical operations such as managing servers and hosting facilities, the day to day operations of managing data remain. These include verifying data receipt, checking data issues, escalating to data suppliers where needed, perfecting data storage, and ensuring users can query the data efficiently.
These tasks are non-core to most organizations, but they are a critical hygiene factor to building a modern data-driven organization. At Deductive, we run data operations for companies at scale. We have developed a specialist team and toolset for monitoring and managing data infrastructure on AWS, and in this article, we explain our approach to data operations.
What is data operations?
AWS gives fully managed infrastructure for data. From data collection with Kinesis to data warehousing on Redshift and data lakes on S3 and Amazon Athena, it is an impressively comprehensive toolset. However, although Amazon gives a fully managed service for the infrastructure they do not offer the data operations to ensure that the data flows smoothly and is usable by staff. To ensure that data flows smoothly, we need management across four different areas: data ingest, data health, data storage, and data security.
Monitoring data ingest
Data ingestion on AWS is typically implemented using Lambda functions, bespoke python scripts, CRON jobs on EC2, AWS Batch, and direct access to S3 buckets by third parties. To verify the flow of data through into these platforms, we have developed a tool, the Pipeline API, that will confirm all incoming data and apply data filtering, alerting, tokenization, and monitoring. This checks can include whether the data is coming in on time; if it is correct and well formed; and whether it is of the right size.
If data is stored in S3, the Pipeline API can be implemented merely by opening access to the S3 bucket to the Pipeline API. Once the monitoring is in place, our team will watch for the first weeks the state of the data coming in and build rules on how the data should be confirmed. Once these rules are set up and agreed, the team can investigate and escalate when data does not meet specifications.
The level of activity at this stage can vary from characterizing the problem and escalating to the in-house team through to end to end debugging and working with third-party suppliers to resolve the issue.
Most data platforms do not always have perfect data, and it's essential for data analysts and data scientists to be aware of where there might be issues with the data. As part of a data operations service, we will typically provide a data health dashboard that can be implemented into reporting tools like Tableau to give the entire organization with a single view of the health of the data in the data lake.
Such a dashboard covers how recently the data sources were updated, if there were outages and days where data is missing, or if there are issues only with some specific fields from a data provider. This granular view of data health is critical to building trust in data across the organization. Mainly where datasets have been unreliable in the past, it can be slow to establish credibility amongst data scientists who will often stick to data they know rather than adopt new data sets that they are concerned might not be correct.
Optimizing Data Storage
Although AWS has fully managed storage on AWS, there are complexities to both S3 and Redshift that need significant trade-offs between performance availability, and cost, and each of these services requires pro-active management on the data level to ensure performance at scale.
For S3 we can ensure that the most cost-effective availability zones are used as the data grows. We can set-up proper transformations for incoming data needs in different file size or format for best querying on security reasons. For example, some information is best saved as small gzip files when others perform better as larger parquet files. We can also maximize query performance of Athena catalogs so that internal data science teams can focus on data science rather than fiddling with the data structures.
For Redshift, we give a full DBA service. This service includes checking queries and optimizing them for your users daily. Our standard is to ensure that they all run in less than 5 minutes, and we find this is realistic for over 95% of queries on most datasets, even when using the lower cost Redshift options. Our data engineering team will propose and implement schema updates to optimize cost and performance, including dropping data to S3 and accessing it through Redshift Spectrum where necessary. We will also handle all regular maintenance jobs such as vacuum and encoding updates.
Ensuring data is secure requires regular updates to be sure that only necessary data is loaded into the platform, it is just retained for as long as it is needed, and that only authorized people access, and that they can only access it for purposes that the owners of the data have given consent.
The services we include under data operations are particular and limited only to a small subset of data security, but they are also four areas that are typically not well served by broader data security initiative. Firstly, we audit data access. AWS provides the tools to record all data access, but the results can be hard to interpret. We assemble this into a profile for each user and report on it at a summary level. Unusual behavior is flagged up for investigation.
Secondly, we will audit personally identifiable information (PII) to ensure that it is stumble tokenized or anonymized at all stages and that only authorized users have access to the untokenized data. If a data provider starts including PII where they had not previously provided it, your data lake can quickly become a liability.
Finally, we can implement the “right to be forgotten” process that is mandated under GDPR. For this, all records relating to an individual need to be deleted from data platforms, and across any third parties where data has been sent. With extensive data stores across S3 and Redshift, this can be a lengthy task and is well suited for outsourcing.
In-house or outsource?
On AWS, it is straightforward to build out a data platform that can efficiently scale rapidly, but it does need pro-active management. Rather than dedicating valuable internal resources to data operations, many companies are looking to outsource data operations.
In part, because it is the kind of activity that benefits from a diversity experience. At Deductive, we process many different data types, and our data engineering has more experience with many of the operational issues than most in-house teams. Data operations also benefit from economies of scale. We can afford to have a team that is continuously learning all the new tools and services available from AWS. This can be hard for many internal teams to keep up with.
Finally, there are few off-the-shelf tools available for data operations. Our investment in the Pipeline API has been driven by our need for tools we can use when we run data pipelines for our clients. Only the right combination of skills, tools, and experience can make data operations efficient, and that either requires enough scale to build a strong data operations team in-house or an experienced partner to outsource to.
Increasingly, we are finding companies are turning to us to help.