High-level overview of AWS Glue
Glue is a managed extract, transform, and load service also called commonly ETL service.
It’s very useful to prepare and transform data for analytics.
So this is a fully serverless service, and you’re just going to submit whatever you want and it will achieve it.
For example, say you had data in an S3 bucket or an Amazon RDS database and you wanted to load this into a Redshift data warehouse.
So you could extract it using Glue, then you would transform it, if need be to maybe filter some data or add some columns and so on, whatever you want.
And then you would load the final output data into a Redshift data warehouse.
So all of this happened from within the Glue ETL Service.
You just have to write some code, launch your ETL job and off you go.
Another example is how to convert data into the Parquet formats.
So why would you do this?
Because, well, the Parquet formats is a columnar data format and, therefore, it is much better when in use, for example, with services like Athena.
So say for example that you are doing inserts into your S3 buckets and then these files are in the CSV formats.
Then you would use the Glue ETL Service to import the CSV and convert it into a Parquet format from within the Glue Service.
Then you would send it into an output S3 bucket.
And when in Parquet format then Amazon Athena is going to analyze this file in a much better fashion.
So the other thing you can do to automate this entire process is that anytime a file is inserted into the S3 bucket, then you can send events notifications to a Lambda function which will then trigger a Glue ETL job.
But you could replace the Lambda function with Event Bridge as well.
This would work as an alternative.
So there’s another feature of Glue called the Glue Data Catalog, which is to catalog data sets.
So the Glue Data Catalog will run Glue Data Crawlers and they will be connected to various data sources such as Amazon S3, Amazon RDS, Amazon DynamoDB or a compatible JDC database that you own on premises for example.
So the Glue Data Crawler is going to crawl these databases and is going to write all the metadata of your tables, of your columns, of your data types, and so on into the Glue Data Catalog.
And so it will have all the databases, the tables, and the metadata, and that will be leveraged by the Glue jobs to perform ETL.
Now, also when you use Amazon Athena behind the scene to do the data discovery and the SQL discovery, Amazon Athena is going to be leveraging the AWS Glue Data Catalog.
So will Amazon Redshift Spectrum, and so will Amazon EMR.
So as you can see, the Glue Data Catalog service is central to many other AWS services.
So other features that you should know at a high level, Glue Job Bookmarks.
And so this is to prevent you from reprocessing all data in case you run a new ETL job.
So this is very important.
Then you have Glue Elastic Views, and this is to combine and replicate data across multiple data stores using SQL.
So you would, for example, create a view across an RDS database and an aurora database and Amazon S3.
So there’s no custom code for this, and Glue is going to monitor for changes in the source data and it’s going to be serverless.
You have Glue DataBrew which is used to clean and normalize data using prebuilt transformation.
You have Glue Studio which is a GUI to create, run and monitor ETL jobs in Glue.
And then you have Glue Streaming ETL, and it’s actually built on top of Apache Spark Structured Streaming, and it’s instead of running ETL jobs, as you know, batch jobs, you can run them as streaming jobs.
And so, therefore, you can read data using Glue Streaming ETL from Kinesis Data Streams or Kafka or MSK as we’ll see which is managed Kafka on AWS.