AWS Certified Machine Learning Engineer - Associate

AWS Certified Machine Learning Engineer - Associate · free preview Question 1 of 10

A machine learning team receives 3 TB of clickstream logs each day from an on-premises data center. The raw files arrive hourly as compressed JSON. Data scientists query the data repeatedly by event_date and region from Amazon Athena and also use Amazon SageMaker Processing jobs for feature engineering. The team wants to minimize query cost and avoid repeatedly scanning full raw JSON files.

Which ingestion and storage design best meets these requirements?

Stream the logs into Amazon SQS and configure Athena to query messages directly from the queue.

Land the raw JSON in Amazon S3, convert it to Apache Parquet using AWS Glue or an Amazon EMR/SageMaker Processing job, and write it to an S3 curated zone partitioned by event_date and region.

Upload the compressed JSON files to Amazon EFS and mount the file system from SageMaker notebooks for all analysis.

Store all incoming JSON records directly in Amazon RDS for PostgreSQL and create indexes on event_date and region.