Amazon Web Services (AWS) launched general availability of its fully-managed Lake Formation platform designed to help organizations better manage their data lakes. The service helps with the building, securing, and managing of those data repositories.

Lake Formation, which was initially announced at the AWS re:Invent show late last year, is built on AWS’ Glue extract, transform, and load (ETL) service. It automates the provisioning and configuring of storage; crawls the data to extract schema and metadata tags; automatically optimizes the partitioning of the data; and transforms the data into formats like Apache Parquet and ORC for easier analytics.

Data can be ingested from different sources using pre-defined templates. The product then automatically classifies and prepares the data using an organization’s data access policies to govern access to that data.

It also provides a centralized point to set up and manage those data access policies, governance, and auditing across AWS' Simple Storage Service (S3) and multiple analytics engines. Those engines include Amazon Redshift cloud data warehouse and Athena interactive query service. It will be adding support for the AWS EMR big data framework, Insight business intelligence service, and SageMaker machine learning platform.

Data lakes act as central repositories for structured and unstructured data. Ideally, they remove silos between data sources so an organization can access all of those sources. However, the challenge becomes how that mass of data is categorized for ease of access.

AWS notes that Lake Formation allows an organization to interact with that data using analytics tools and machine learning. It also cleans and deduplicates data using machine learning to improve data consistency and quality. AWS claims that the platform can reduce the time it takes to tap into data lakes from months to just a few days.

Customers can access the service for free, only paying for the underlying AWS services being used. It’s available today in a handful of AWS regions, specifically US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). Additional regions will be “coming soon.”

Great Data Lakes

The use of data lakes has grown alongside the increased use of cloud platforms as storage mediums. Gartner released a report last month that predicts 75 percent of all databases will be deployed or migrated to the cloud by 2022.

“According to inquiries with Gartner clients, organizations are developing and deploying new applications in the cloud and moving existing assets at an increasing rate, and we believe this will continue to increase,” said Donald Feinberg, distinguished research vice president at Gartner, in the report. “We also believe this begins with systems for data management solutions for analytics (DMSA) use cases — such as data warehousing, data lakes, and other use cases where data is used for analytics, artificial intelligence (AI) and machine learning (ML). Increasingly, operational systems are also moving to the cloud, especially with conversion to the SaaS application model.”

That report noted that AWS and Microsoft accounted for most of the cloud database management services growth last year.

Microsoft has a number of initiatives in this space, including a work-in-progress with SAP and Adobe with their Open Data Initiative (ODI). That program provides a blueprint for the sharing of data between the companies’ respective applications and platforms. The main goal of the initiative is to lower barriers between customer experience management silos.

The initiative allows organizations to own and maintain control of the data resources. They can also use artificial intelligence (AI)-driven business processes to glean insight and intelligence from the data. The open concept also invites an ecosystem approach that allows other vendors to tap into and add to the data model.

Specific to the vendors involved, it targets interoperability and data exchange between Microsoft’s Dynamics 365, SAP’s C/4HANA and S/4HANA platforms, and Adobe’s Experience Cloud. The data model uses a common data lake service running on Microsoft’s Azure cloud platform, though the latest update indicates support for “a customer-chosen data lake.”