A machine learning team receives 3 TB of clickstream logs each day from an on-premises data center. The raw files arrive hourly as compressed JSON. Data scientists query the data repeatedly by event_date and region from Amazon Athena and also use Amazon SageMaker Processing jobs for feature engineering. The team wants to minimize query cost and avoid repeatedly scanning full raw JSON files.
Which ingestion and storage design best meets these requirements?