Q38 — AWS DEA-C01 Ch.1
Question 38 of 100 | ← Chapter 1
A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer Needs to query only one column of the data. Which solution will meet these requirements with the LEAST operational overhead?
- A. Congure an AWS Lambda function to load data from the S3 bucket into a pandas dataframe. Write a SQL SELECT statement on the Dataframe to query the required column.
- B. Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects. ✓
- C. Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column.
- D. Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in Amazon Athena to query the required column.
Correct Answer: B. Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.
Explanation
答案B是最优选择。S3Select专门用于从S3存储桶中的对象(如ApacheParquet格式的数据)中直接查询所需的列,无需加载整个数据集,操作开销最小。选项A中使用AWSLambda函数并将数据加载到pandasdataframe再查询,过程相对复杂且可能带来较大开销。选项C中使用AWSGlueDataBrew项目准备工作较多。选项D中先运行AWSGlue爬虫,再在Athena中查询,步骤较为繁琐,开销相对较大。所以,综合考虑,选项B是满足需求且操作开销最小的方案。