Q41 — AWS DEA-C01 Ch.1
Question 41 of 100 | ← Chapter 1
A company needs to set up a data catalog and metadata management for data sources that run in the AWS Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3. The company needs a solution that will update the data catalog on a regular basis. The solution also must detect changes to the source metadata. Which solution will meet these requirements with the LEAST operational overhead?
- A. Use Amazon Aurora as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the Aurora data catalog. Schedule the Lambda functions to run periodically.
- B. Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and to Update the Data Catalog with metadata changes. Schedule the crawlers to run periodically to update the metadata catalog. ✓
- C. Use Amazon DynamoDB as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the DynamoDB data catalog. Schedule the Lambda functions to run periodically.
- D. Use the AWS Glue Data Catalog as the central metadata repository. Extract the schema for Amazon RDS and Amazon Redshift sources, And build the Data Catalog. Use AWS Glue crawlers for data that is in Amazon S3 to infer the schema and to automatically update the Data Catalog.
Correct Answer: B. Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and to Update the Data Catalog with metadata changes. Schedule the crawlers to run periodically to update the metadata catalog.
Explanation
答案B是最优选择。AWSGlueDataCatalog专门设计用于处理多源数据的元数据管理,包括结构化和半结构化数据源。Glue爬虫可以自动连接到各种数据存储,并定期检测和更新元数据变化,减少了手动配置和编码的需求,从而降低了操作开销。而选项A中使用Aurora作为数据目录并通过Lambda函数更新,以及选项C中使用DynamoDB结合Lambda函数,都不如GlueDataCatalog和爬虫的组合便捷和高效。选项D只提到了部分数据源的处理方式,不如B全面。所以选择B选项。