Q32 — AWS DEA-C01 Ch.1

Question 32 of 100 | ← Chapter 1

A data engineer must ingest a source of structured data that is in .csv format into an Amazon S3 data lake. The .csv files contain 15 columns. Data analysts need to run Amazon Athena queries on one or two columns of the dataset. The data analysts rarely query the entire file. Which solution will meet these requirements MOST cost-effectively?

Correct Answer: D. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to write the data into the data lake in Apache Parquet format.

Explanation

考虑到数据分析师通常只查询数据集的一两列,并且很少查询整个文件,选择一种能够有效压缩和存储数据的格式对于降低成本至关重要。ApacheParquet格式是一种列式存储格式,它对于只读查询和部分列读取特别优化,可以有效减少查询时的数据扫描量,从而降低查询成本。因此,创建一个AWSGlueETL作业,将.csv格式的源数据读取并写入到数据湖中,使用ApacheParquet格式,是最符合成本效益的解决方案。所以,答案是D。