Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    Tailoring nanoscale interfaces for perovskite–perovskite–silicon triple-junction solar cells

    October 13, 2025

    SGLA criticizes California Governor Newsom for signing ‘flawed, rushed’ sweepstakes ban

    October 13, 2025

    Gesture Recognition for Busy Hands

    October 13, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Big Data»Optimize Amazon EMR runtime for Apache Spark with EMR S3A
    Big Data

    Optimize Amazon EMR runtime for Apache Spark with EMR S3A

    big tee tech hubBy big tee tech hubSeptember 25, 2025009 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    Optimize Amazon EMR runtime for Apache Spark with EMR S3A
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    With the Amazon EMR 7.10 runtime, Amazon EMR has introduced EMR S3A, an improved implementation of the open source S3A file system connector. This enhanced connector is now automatically set as the default S3 file system connector for Amazon EMR deployment options, including Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, and Amazon EMR on AWS Outposts, maintaining complete API compatibility with open source Apache Spark.

    In the Amazon EMR 7.10 runtime for Apache Spark, the EMR S3A connector exhibits performance comparable to EMRFS for read workloads, as demonstrated by TPC-DS query benchmark. The connector’s most significant performance gains are evident in write operations, with a 7% improvement in static partition overwrites and a 215% improvement for dynamic partition overwrites when compared to EMRFS. In this post, we showcase the enhanced read and write performance advantages of using Amazon EMR 7.10.0 runtime for Apache Spark with EMR S3A as compared to EMRFS and the open source S3A file system connector.

    Read workload performance comparison

    To evaluate the read performance, we used a test environment based on Amazon EMR runtime version 7.10.0 running Spark 3.5.5 and Hadoop 3.4.1. Our testing infrastructure featured an Amazon Elastic Compute Cloud (Amazon EC2) cluster comprised of nine r5d.4xlarge instances. The primary node has 16 vCPU and 128 GB memory, and the eight core nodes have a total of 128 vCPU and 1024 GB memory.

    The performance evaluation was conducted using a comprehensive testing methodology designed to provide accurate and meaningful results. For the source data, we chose the 3 TB scale factor, which contains 17.7 billion records, approximately 924 GB of compressed data partitioned in Parquet file format. The setup instructions and technical details can be found in the GitHub repository. We used Spark’s in-memory data catalog to store metadata for TPC-DS databases and tables.

    To produce a fair and accurate comparison between EMR S3A vs. EMRFS and open source S3A implementations, we implemented a three-phase testing approach:

    • Phase 1: Baseline performance:
      • Established a baseline using default Amazon EMR configuration with EMR’s S3A connector
      • Created a reference point for subsequent comparisons
    • Phase 2: EMRFS analysis:
      • Maintained the default file system as EMRFS
      • Preserved other configuration settings
    • Phase 3: Open source S3A testing:
      • Modified only the hadoop-aws.jar file by replacing it with the open source Hadoop S3A 3.4.1 version
      • Maintained identical configurations across other components

    This controlled testing environment was crucial for our evaluation for the following reasons:

    • We could isolate the performance impact specifically to the S3A connector implementation
    • It removed potential variables that could skew the results
    • It provided accurate measurements of performance improvements between Amazon’s S3A implementation and the open source alternative

    Test execution and results

    Throughout the testing process, we maintained consistency in test conditions and configurations, making sure any observed performance differences could be directly attributed to the S3A connector implementation variations. A total of 104 SparkSQL queries were run in 10 iterations sequentially, and an average of each query’s runtime in these 10 iterations was used for comparison. The average of the 10 iterations’ runtime on the Amazon EMR 7.10 runtime for Apache Spark with EMR S3A was 1116.87 seconds, which is 1.08 times faster than open source S3A and comparable with EMRFS. The following figure illustrates the total runtime in seconds.

    image 1 4

    The following table summarizes the metrics.

    Metric OSS S3A EMRFS EMR S3A
    Average runtime in seconds 1208.26 1129.64 1116.87
    Geometric mean over queries in seconds 7.63 7.09 6.99
    Total cost * $6.53 $6.40 $6.15

    *Detailed cost estimates are discussed later in this post.

    The following chart demonstrates the per-query performance improvement of EMR S3A relative to open source S3A on the Amazon EMR 7.10 runtime for Apache Spark. The extent of the speedup varies from one query to another, with the fastest up to 1.51 times faster for q3, with Amazon EMR S3A outperforming open source S3A. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

    image 2 5

    Read cost comparison

    Our benchmark outputs the total runtime and geometric mean figures to measure the Spark runtime performance. The cost metric can provide us with additional insights. Cost estimates are computed using the following formulas. They factor in Amazon EC2, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR costs, but don’t include Amazon Simple Storage Service (Amazon S3) GET and PUT costs.

    • Amazon EC2 cost (include SSD cost) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
      • r5d.4xlarge hourly rate = $1.152 per hour
    • Root Amazon EBS cost = number of instances * Amazon EBS per GB-hourly rate * root EBS volume size * job runtime in hours
    • Amazon EMR cost = number of instances * r5d.4xlarge Amazon EMR cost * job runtime in hours
      • r5d.4xlarge Amazon EMR cost = $0.27 per hour
    • Total cost = Amazon EC2 cost + root Amazon EBS cost + Amazon EMR cost

    The following table summarizes these costs.

    Metric EMRFS EMR S3A OSS S3A
    Runtime in hours 0.5 0.48 0.51
    Number of EC2 instances 9 9 9
    Amazon EBS size 0 gb 0 gb 0 gb
    Amazon EC2 cost $5.18 $4.98 $5.29
    Amazon EBS cost $0.00 $0.00 $0.00
    Amazon EMR cost $1.22 $1.17 $1.24
    Total cost $6.40 $6.15 $6.53
    Cost savings Baseline EMR S3A is 1.04 times better than EMRFS EMR S3A is 1.06 times better than OSS S3A

    Write workload performance comparison

    We conducted benchmark tests to assess the write performance of the Amazon EMR 7.10 runtime for Apache Spark.

    Static table/partition overwrite

    We evaluated the static table/partition overwrite write performance of the different file system by executing the following INSERT OVERWRITE Spark SQL query. The SELECT * FROM range(...) clause generated data at execution time. This produced approximately 15 GB of data across exactly 100 Parquet files in Amazon S3.

    SET rows=4e9; -- 4 Billion
    SET partitions=100;
    INSERT OVERWRITE DIRECTORY 's3://${bucket}/perf-test/${trial_id}'
    USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});

    The test environment was configured as follows:

    • EMR cluster with emr-7.10.0 release label
    • Single m5d.2xlarge instance (primary group)
    • Eight m5d.2xlarge instances (core group)
    • S3 bucket in the same AWS Region as the EMR cluster
    • The trial_id property used a UUID generator to avoid conflict between test runs

    Results

    After running 10 trials for each file system, we captured and summarized query runtimes in the following chart. Whereas EMR S3A averaged only 26.4 seconds, the EMRFS and open source S3A averaged 28.4 seconds and 31.4 seconds—a 1.07 times and 1.19 times improvement, respectively.

    image 3 3

    Dynamic partition overwrite

    We also evaluated the write performance by executing the following INSERT OVERWRITE dynamic partition Spark SQL query, which joins TPC-DS 3TB partitioned Parquet data of the table web_sales and date_dim tables, which inserts approximately 2,100 partitions, where each partition contains one Parquet file with a combined size of approximately 31.2 GB in Amazon S3.

    SET spark.sql.sources.partitionOverwriteMode=DYNAMIC;
    INSERT OVERWRITE TABLE  PARTITION(wsdt_year,wsdt_month, wsdt_day) 
    SELECT ws_order_number,ws_quantity,ws_list_price,ws_sales_price,
    ws_net_paid_inc_ship_tax,ws_net_profit,dt.d_year as wsdt_year,dt.d_moy 
    as wsdt_month,dt.d_dom as wsdt_day FROM web_sales, date_dim dt 
    WHERE ws_sold_date_sk = d_date_sk;

    The test environment was configured as follows:

    • EMR cluster with emr-7.10.0 release label
    • Single r5d.4xlarge instance (master group)
    • Five r5d.4xlarge instances (core group)
    • Approximately 2,100 partitions with one Parquet file each
    • Combined size of approximately 31.2 GB in Amazon S3

    Results

    After running 10 trials for each file system, we captured and summarized query runtimes in the following chart. Whereas EMR S3A averaged only 90.9 seconds, the EMRFS and open source S3A averaged 286.4 seconds and 1,438.5 seconds—a 3.15 times and 15.82 times improvement, respectively.

    image 4 5

    Summary

    Amazon EMR consistently enhances its Apache Spark runtime and S3A connector, delivering continuous performance improvements that help big data customers execute analytics workloads more cost-effectively. Beyond performance gains, the strategic shift to S3A introduces critical advantages, including enhanced standardization, improved cross-platform portability, and robust community-driven support—all while maintaining or surpassing the performance benchmarks established by the previous EMRFS implementation.

    We recommend that you stay up-to-date with the latest Amazon EMR release to take advantage of the latest performance and feature benefits. Subscribe to the AWS Big Data Blog’s RSS feed to learn more about the Amazon EMR runtime for Apache Spark, configuration best practices, and tuning advice.


    About the authors

    Giovanni Matteo Fumarola

    Giovanni Matteo Fumarola

    Giovanni is the Senior Manager for the Amazon EMR Spark and Iceberg group. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.

    Sushil Kumar Shivashankar

    Sushil Kumar Shivashankar

    Sushil is the Engineering Manager for the Amazon EMR Hadoop and Flink team at Amazon Web Services. With a focus on big data analytics since 2014, he leads development, optimizations, and growth strategies for Hadoop and Flink business in Amazon EMR.

    Narayanan Venkateswaran

    Narayanan Venkateswaran

    Narayanan is a Senior Software Development Engineer in the Amazon EMR group. He works on developing Hadoop components in Amazon EMR. He has over 20 years of work experience in the industry across several companies, including Sun Microsystems, Microsoft, Amazon, and Oracle. Narayanan also holds a PhD in databases with a focus on horizontal scalability in relational stores.

    Syed Shameerur Rahman

    Syed Shameerur Rahman

    Syed is a Software Development Engineer at Amazon EMR. He is interested in highly scalable, distributed computing. He is an active contributor of open source projects like Apache Hive, Apache Tez, Apache ORC, and Apache Hadoop, and has contributed important features and optimizations. During his free time, he enjoys exploring new places and trying new foods.

    Rajarshi Sarkar

    Rajarshi Sarkar

    Rajarshi is a Software Development Engineer at Amazon EMR. He works on cutting-edge features of Amazon EMR and is also involved in open source projects such as Apache Hive, Iceberg, Trino, and Hadoop. In his spare time, he likes to travel, watch movies, and hang out with friends.



    Source link

    Amazon Apache EMR optimize runtime S3A Spark
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    Building a real-time ICU patient analytics pipeline with AWS Lambda event source mapping

    October 12, 2025

    Data Reliability Explained | Databricks Blog

    October 12, 2025

    5 Reasons AI-Driven Business Need Dedicated Servers

    October 11, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    Tailoring nanoscale interfaces for perovskite–perovskite–silicon triple-junction solar cells

    October 13, 2025

    SGLA criticizes California Governor Newsom for signing ‘flawed, rushed’ sweepstakes ban

    October 13, 2025

    Gesture Recognition for Busy Hands

    October 13, 2025

    Inside the ‘Let’s Break It Down’ Series for Network Newbies

    October 13, 2025
    Advertisement
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    Tailoring nanoscale interfaces for perovskite–perovskite–silicon triple-junction solar cells

    October 13, 2025

    SGLA criticizes California Governor Newsom for signing ‘flawed, rushed’ sweepstakes ban

    October 13, 2025

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2025 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.