Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    FaZe Clan’s future is uncertain after influencers depart

    December 27, 2025

    Airbus prepares tender for European sovereign cloud

    December 27, 2025

    Indie App Spotlight: ‘Cannot Ignore’ brings full screen alarms to your calendar and more

    December 27, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Big Data»Apache Spark encryption performance improvement with Amazon EMR 7.9
    Big Data

    Apache Spark encryption performance improvement with Amazon EMR 7.9

    big tee tech hubBy big tee tech hubDecember 1, 20250011 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    Apache Spark encryption performance improvement with Amazon EMR 7.9
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open source Apache Spark. With Amazon EMR release 7.9.0, the EMR runtime for Apache Spark introduces significant performance improvements for encrypted workloads, supporting Spark version 3.5.5.

    For compliance and security requirements, many customers need to enable Apache Spark’s local storage encryption (spark.io.encryption.enabled = true) in addition to Amazon Simple Storage Service (Amazon S3) encryption (such as server-side encryption (SSE) or AWS Key Management Service (AWS KMS)). This feature encrypts shuffle files, cached data, and other intermediate data written to local disk during Spark operations, protecting sensitive data at rest on Amazon EMR cluster instances.

    Industries subject to regulations such as the Health Insurance Portability and Accountability Act (HIPAA) for healthcare, Payment Card Industry Data Security Standard (PCI-DSS) for financial services, General Data Protection Regulation (GDPR) for personal data, and Federal Risk and Authorization Management Program (FedRAMP) for government often require encryption of all data at rest, including temporary files on local storage. While Amazon S3 encryption protects data in object storage, Spark’s I/O encryption secures the intermediate shuffle and spill data that Spark writes to local disk during distributed processing—data that never reaches Amazon S3 but might contain sensitive information extracted from source datasets. Generally, encrypted operations require additional computational overhead that can impact overall job performance.

    With the built-in encryption optimizations of Amazon EMR 7.9.0, customers might see significant performance improvements in their Apache Spark applications without requiring any application changes. In our performance benchmark tests, derived from TPC-DS performance tests at 3 TB scale, we observed up to 20% faster performance with the EMR 7.9 optimized Spark runtime compared to Spark without these optimizations. Individual results may vary depending on specific workloads and configurations.

    In this post, we analyze the results from our benchmark tests comparing the Amazon EMR 7.9 optimized Spark runtime against Spark 3.5.5 without encryption optimizations. We walk through a detailed cost analysis and provide step-by-step instructions to reproduce the benchmark.

    Results observed

    To evaluate the performance improvements, we used an open source Spark performance test utility derived from the TPC-DS performance test toolkit. We ran the tests on two nine-node (eight core nodes and one primary node) r5d.4xlarge Amazon EMR 7.9.0 clusters, comparing two configurations:

    • Baseline: EMR 7.9.0 cluster with a bootstrap action installing Spark 3.5.5 without encryption optimizations
    • Optimized: EMR 7.9.0 cluster using the EMR Spark 3.5.5 runtime with encryption optimizations

    Both tests used data stored in Amazon Simple Storage Service (Amazon S3). All data processing was configured identically except for the Spark runtime version.

    To maintain benchmarking consistency and ensure a consistent, equivalent comparison, we disabled Dynamic Resource Allocation (DRA) in both test configurations. This approach eliminates variability from dynamic scaling and so we can measure pure computational performance improvements.

    The following table shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between the baseline and Amazon EMR 7.9 optimized configurations:

    Configuration Total runtime (seconds) Geometric mean (seconds) Performance improvement
    Baseline (Spark 3.5.5 without optimization) 1,485 10.24
    EMR 7.9 (with encryption optimization) 1,176 8.15 20% faster

    We observed that our TPC-DS tests with the Amazon EMR 7.9 optimized Spark runtime completed about 20% faster based on total runtime and 20% faster based on geometric mean compared to the baseline configuration.

    The encryption optimizations in Amazon EMR 7.9 deliver performance benefits through:

    • Improved shuffle and decryption operations reducing overhead during data exchange without compromising security
    • Better memory management for intermediate results

    Cost analysis

    The performance improvements of the Amazon EMR 7.9 optimized Spark runtime directly translate to lower costs. We realized an approximately 20% cost savings running the benchmark application with encryption optimizations compared to the baseline configuration, because of reduced hours of EMR, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS) using General Purpose SSD (gp2).

    The following table summarizes the cost comparison in the us-east-1 AWS Region:

    Configuration Runtime (hours) Estimated cost Total EC2 instances Total vCPU Total memory (GiB) Root device (EBS)
    Baseline: Spark 3.5.5 without optimization, 1 primary and 8 core nodes 0.41 $5.28 9 144 1152 64 GiB gp2
    Amazon EMR 7.9 with optimization, 1 primary and 8 core nodes 0.33 $4.25 9 144 1152 64 GiB gp2

    Cost breakdown

    Formulas used:

    • Amazon EMR cost – Number of instances × EMR hourly rate × Runtime hours
    • Amazon EC2 cost – Number of instances × EC2 hourly rate × Runtime hour)
    • Amazon EBS cost – (EBS cost per GB per month ÷ hours in a month) × EBS volume size × number of instances × runtime hours

    Note: EBS is priced monthly ($0.1 per GB per month), so we divide by 730 hours to convert to an hourly rate. EMR and EC2 are already priced hourly, so no conversion is needed.

    Baseline configuration (0.41 hours):

    • Amazon EMR cost – 9 × $0.27 × 0.41 = $1.00
    • Amazon EC2 cost – 9 × $1.152 × 0.41 = $4.25
    • Amazon EBS cost – ($0.1/730 × 64 × 9 × 0.41) = $0.032
    • Total cost – $5.28

    EMR 7.9 optimized configuration (0.33 hours):

    • Amazon EMR cost – (9 × $0.27 × 0.33) = $0.80
    • Amazon EC2 cost – (9 × $1.152 × 0.33) = $3.42
    • Amazon EBS cost – ($0.1/730 × 64 × 9 × 0.33) = $0.025
    • Total cost: $4.25

    Total cost savings: 20% per benchmark run, which scales linearly with your production workload frequency.

    Set up EMR benchmarking

    For detailed instructions and scripts, see the companion GitHub repository.

    Prerequisites

    To set up Amazon EMR benchmarking, start by completing the following prerequisite steps:

    1. Configure your AWS Command Line Interface (AWS CLI) by running aws configure to point to your benchmarking account,
    2. Create an S3 bucket for test data and results.
    3. Copy the TPC-DS 3TB source data from a publicly available dataset to your S3 bucket using the following command:
      aws s3 cp s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned s3:///BLOG_TPCDS-TEST-3T-partitioned --recursive

      Replace with the name of the S3 bucket you created in step 2.

    4. Build or download the benchmark application JAR file (spark-benchmark-assembly-3.3.0.jar)
    5. Ensure you have appropriate AWS Identity Access Management (IAM) roles for EMR cluster creation and Amazon S3 access

    Deploy the baseline EMR cluster (without optimization)

    Step 1: Launch EMR 7.9.0 cluster with bootstrap action

    The baseline configuration uses a bootstrap action to install Spark 3.5.5 without encryption optimizations. We have made the bootstrap script publicly available in an S3 bucket for your convenience.

    Create the default Amazon EMR roles:

    aws emr create-default-roles

    Now create the cluster:

    aws emr create-cluster \
      --name "EMR-7.9-Baseline-Spark-3.5.5" \
      --release-label emr-7.9.0 \
      --applications Name=Spark \
      --ec2-attributes SubnetId=,InstanceProfile=EMR_EC2_DefaultRole  \
      --service-role EMR_DefaultRole
      --instance-groups \
        InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
        InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
      --bootstrap-actions \
        Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Name="install spark 3.5.5 without encryption optimization" \
      --use-default-roles \
      --log-uri s3:///logs/baseline/

    Note: The bootstrap script is available in a public S3 bucket at s3://spark-ba/install-spark-3-5-5-no-encryption.sh. This script installs Apache Spark 3.5.5 without the encryption optimizations present in the Amazon EMR runtime.

    Step 2: Submit the benchmark job to the baseline cluster

    Next submit the Spark job using the following commands:

    aws emr add-steps \
      --cluster-id  \  
      --steps 'Type=Spark,Name="EMR-7.9-Baseline-Spark-3.5.5 Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=false","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3:///jar/spark-benchmark-assembly-3.3.0.jar","s3:///blog/BLOG_TPCDS-TEST-3T-partitioned","s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

    Deploy the optimized EMR cluster (with encryption optimization)

    Step 1: Launch EMR 7.9.0 cluster with Spark runtime

    The optimized configuration uses the EMR 7.9.0 Spark runtime without any bootstrap actions:

    aws emr create-cluster \
      --name "EMR-7.9-Optimized-Native-Spark" \
      --release-label emr-7.9.0 \
      --applications Name=Spark \
      --ec2-attributes SubnetId=,InstanceProfile=EMR_EC2_DefaultRole \
      --service-role EMR_DefaultRole
      --instance-groups \
        InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
        InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
      --use-default-roles \
      --log-uri s3:///logs/optimized/

    Example:

    aws emr create-cluster \
    --name "EMR-7.9-Optimized-Native-Spark" \
    --release-label emr-7.9.0 \
    --applications Name=Spark \
    --ec2-attributes SubnetId=subnet-08a5f71f92bc8a801 \
    --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
    --bootstrap-actions \
    Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Name="install spark 3.5.5 without encryption optimization" \
    --use-default-roles \
    --log-uri s3://aws-logs-123456789012-us-west-2/elasticmapreduce/

    Step 2: Submit the benchmark job to optimized cluster

    ext submit the Spark job using the following commands:

    aws emr add-steps \
      --cluster-id  \ 
      --steps 'Type=Spark,Name="EMR-7.9-Optimized-Native-Spark Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=true","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3:///jar/spark-benchmark-assembly-3.3.0.jar","s3:///blog/BLOG_TPCDS-TEST-3T-partitioned","s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

    Benchmark command parameters explained

    The Amazon EMR Spark step uses the following parameters:

    • EMR step configuration:
      • Type=Spark: Specifies this is a Spark application step
      • Name=”EMR-7.9-Baseline-Spark-3.5.5″: Human-readable name for the step
      • ActionOnFailure=CONTINUE: Continue with other steps if this one fails
    • Spark submit arguments:
      • –deploy-mode client: Run the driver on the master node (not cluster mode)
      • –class com.amazonaws.eks.tpcds.BenchmarkSQL: Main class for the TPC-DS benchmark
    • Application parameters:
      • JAR file: s3:///jar/spark-benchmark-assembly-3.3.0.jar
      • Input data: s3:///blog/BLOG_TPCDS-TEST-3T-partitioned (3 TB TPC-DS dataset)
      • Output location: s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT (S3 path for results)
      • TPC-DS tools path: /opt/tpcds-kit/tools(local path on EMR nodes)
      • Format: parquet (output format)
      • Scale factor: 3000 (3 TB dataset size)
      • Iterations: 3 (run each query 3 times for averaging)
      • Collect results: false (don’t collect results to driver)
      • Query list: "q1-v2.4,q10-v2.4,...,ss_max-v2.4" (all 104 TPC-DS queries)
      • Final parameter: true (enable detailed logging and metrics)
    • Query coverage:
      • All 104 standard TPC-DS benchmark queries (q1-v2.4 through q99-v2.4)
      • Plus the ss_max-v2.4 query for additional testing
      • Each query runs 3 times to calculate average performance

    Summarize the results

    1. Download the test result files from both output S3 locations:
      # Baseline results
      aws s3 cp s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv ./baseline-results.csv
         
      # Optimized results
      aws s3 cp s3:///blog/OPTIMIZED_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv ./optimized-results.csv

    2. The CSV files contain four columns (without headers):
      • Query name
      • Median time (seconds)
      • Minimum time (seconds)
      • Maximum time (seconds)
    3. Calculate performance metrics for comparison:
      • Average time per query: AVERAGE(median, min, max) for each query
      • Total runtime: Sum of all median times
      • Geometric mean: GEOMEAN(average times) across all queries
      • Speedup: Calculate the ratio between baseline and optimized for each query
    4. Create comparison analysis:Speedup = (Baseline Time - Optimized Time) / Baseline Time * 100%

    Testing configuration details

    The following table summarizes the test environment used for this post:

    Parameter Value
    EMR release emr-7.9.0 (both configurations)
    Baseline Spark version 3.5.5 (installed through bootstrap action)
    Baseline bootstrap script s3://spark-ba/install-spark-3-5-5-no-encryption.sh (public)
    Optimized spark version Amazon EMR Spark runtime
    Cluster size 9 nodes (1 primary and 8 core)
    Instance type r5d.4xlarge
    vCPUs per node 16
    Memory per node 128 GB
    Instance storage 600 GB SSD
    EBS volume 64 GB gp2 (2 volumes per instance)
    Total vCPUs 144 (9 × 16)
    Total memory 1152 GB (9 × 128)
    Dataset TPC-DS 3TB (Parquet format)
    Queries 104 queries (TPC-DS v2.4)
    Iterations 3 runs per query
    DRA Disabled for consistent benchmarking

    Clean up

    To avoid incurring future charges, delete the resources you created:

    1. Terminate both EMR clusters:
      aws emr terminate-clusters --cluster-ids  

    2. Delete S3 test results if no longer needed:
      aws s3 rm s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT/ --recursive
      aws s3 rm s3:///blog/OPTIMIZED_TPCDS-TEST-3T-RESULT/ --recursive
      aws s3 rm s3:///logs/ --recursive

    3. Remove IAM roles if created specifically for testing

    Key findings

    • Up to 20% performance improvement using the Amazon EMR 7.9’s Spark runtime with no code changes required
    • 20% cost savings because of reduced runtime
    • Significant gains for shuffle-heavy, join-intensive workloads
    • 100% API compatibility with open source Apache Spark
    • Simple migration from custom Spark builds to EMR runtime
    • Easy benchmarking using publicly available bootstrap scripts

    Conclusion

    You can run your Apache Spark workloads up to 20% faster and at lower cost without making any changes to your applications by using the Amazon EMR 7.9.0 optimized Spark runtime. This improvement is achieved through numerous optimizations in the EMR Spark runtime, including enhanced encryption handling, improved data serialization, and optimized shuffle operations.

    To learn more about Amazon EMR 7.9 and best practices, see the EMR documentation. For configuration guidance and tuning advice, subscribe to the AWS Big Data Blog.

    Related resources:

    If you’re running Spark workloads on Amazon EMR today, we encourage you to test the EMR 7.9 Spark runtime with your production workloads and measure the improvements specific to your use case.


    About the authors

    Sonu Kumar Singh

    Sonu Kumar Singh

    Sonu is a Senior Solutions Architect with more than 13 years of experience, with a specialization in Analytics and Healthcare domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact.

    Roshin Babu

    Roshin Babu

    Roshin is a Sr. Specialist Solutions architect at AWS, where he collaborates with the sales team to support public sector clients. His role focuses on developing innovative solutions that solve complex business challenges while driving increased adoption of AWS analytics services. When he’s not working, Roshin is passionate about exploring new destinations, discovering great food, and enjoying soccer both as a player and fan.Polaris Jhandi

    Polaris Jhandi

    Polaris Jhandi

    Polaris is a Cloud Application Architect with AWS Professional Services. He has a background in AI/ML and big data. He is currently working with customers to migrate their legacy mainframe applications to the AWS Cloud.Zheng Yuan

    Zheng Yuan

    Zheng Yuan

    Zheng is a Software Engineer on the Amazon EMR Spark team, where he focuses on improving the performance of the Spark execution engine across various use cases.



    Source link

    Amazon Apache EMR Encryption improvement performance Spark
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    Databricks Spatial Joins Now 17x Faster Out-of-the-Box

    December 27, 2025

    What Amazon, Disney and Netflix now know

    December 27, 2025

    Edge Infrastructure Strategies for Data-Driven Manufacturers

    December 26, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    FaZe Clan’s future is uncertain after influencers depart

    December 27, 2025

    Airbus prepares tender for European sovereign cloud

    December 27, 2025

    Indie App Spotlight: ‘Cannot Ignore’ brings full screen alarms to your calendar and more

    December 27, 2025

    Canada broke its electric vehicle market in 2025 and it did so alone

    December 27, 2025
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    FaZe Clan’s future is uncertain after influencers depart

    December 27, 2025

    Airbus prepares tender for European sovereign cloud

    December 27, 2025

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2025 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.