Q45 — AWS SAP-C02 Ch.3

Question 45 of 75 | ← Chapter 3

Q270. A company is collecting a large amount of data from a fleet of IoT devices. Data is stored as Optimized ROW Columnar (ORC) files in the Hadoop Distributed File System (HDFS) on a persistent Amazon EMR cluster. The company's data analytics team queries the data by using SQL in APache Presto deployed on the same EMR cluster. Queries scan large amounts of data, always run for less 15 minutes, and run only between 5 PM and 10 PM. The company is concerned about the high cost associated with the current solution. A solution architect must propose the most cost-effective solution that will allow SQL data queries. Which solution will meet these requirements?

Correct Answer: B. Store data in Amazon S3. Use the AWS Glue Data Catalog and Amazon Athena to query data.

Explanation

Storing your Data in Amazon S3 and querying it using the AWS Glue Data Catalog and Amazon Athena is a very efficient approach. Its advantages are: Amazon Athena can perform queries using standard SQL and is billed on demand, which means you only pay for the actual query. The AWS Glue Data Catalog can provide a central repository for metadata so that Presto and other applications can use the same metadata definitions, making queries more consistent and reliable. Amazon S3 offers scalable storage options and strong security features such as encryption and access control. Since queries only run between 5 p.m. and 10 p.m., you only need to start the Amazon EMR cluster during that time. Using AWS Glue and Amazon Athena for queries also avoids the high costs associated with EMR clusters. Option A proposes to store the data in Amazon S3 and then query the data using Amazon Redshift Spectrum. While this approach works well for large amounts of data and complex queries, for smaller queries of up to 15 minutes, using Redshift Spectrum is not the most affordable option. Option C proposes storing the data in EMRFS and using Presto to query the data in Amazon EMR. This increases the costs associated with maintaining EMR clusters and can cause performance issues. Option D recommends using Amazon Redshift to store data in a column database. However, this increases the costs associated with maintaining Redshift clusters and is not suitable for quickly querying small data sets.