Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    FaZe Clan’s future is uncertain after influencers depart

    December 27, 2025

    Airbus prepares tender for European sovereign cloud

    December 27, 2025

    Indie App Spotlight: ‘Cannot Ignore’ brings full screen alarms to your calendar and more

    December 27, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Big Data»Modernize Apache Spark workflows using Spark Connect on Amazon EMR on Amazon EC2
    Big Data

    Modernize Apache Spark workflows using Spark Connect on Amazon EMR on Amazon EC2

    big tee tech hubBy big tee tech hubDecember 21, 20250313 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    Modernize Apache Spark workflows using Spark Connect on Amazon EMR on Amazon EC2
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Apache Spark Connect, introduced in Spark 3.4, enhances the Spark ecosystem by offering a client-server architecture that separates the Spark runtime from the client application. Spark Connect enables more flexible and efficient interactions with Spark clusters, particularly in scenarios where direct access to cluster resources is limited or impractical.

    A key use case for Spark Connect on Amazon EMR is to be able to connect directly from your local development environments to Amazon EMR clusters. By using this decoupled approach, you can write and test Spark code on your laptop while using Amazon EMR clusters for execution. This capability reduces development time and simplifies data processing with Spark on Amazon EMR.

    In this post, we demonstrate how to implement Apache Spark Connect on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) to build decoupled data processing applications. We show how to set up and configure Spark Connect securely, so you can develop and test Spark applications locally while executing them on remote Amazon EMR clusters.

    Solution architecture

    The architecture centers on an Amazon EMR cluster with two node types. The primary node hosts both the Spark Connect API endpoint and Spark Core components, serving as the gateway for client connections. The core node provides additional compute capacity for distributed processing. Although this solution demonstrates the architecture with two nodes for simplicity, it scales to support multiple core and task nodes based on workload requirements.

    In Apache Spark Connect version 4.x, TLS/SSL network encryption is not inherently supported. We show you how to implement secure communications by deploying an Amazon EMR cluster with Spark Connect on Amazon EC2 using an Application Load Balancer (ALB) with TLS termination as the secure interface. This approach enables encrypted data transmission between Spark Connect clients and Amazon Virtual Private Cloud (Amazon VPC) resources.

    The operational flow is as follows:

    1. Bootstrap script – During Amazon EMR initialization, the primary node fetches and executes the start-spark-connect.sh file from Amazon Simple Storage Service (Amazon S3). This script starts the Spark Connect server.
    2. Server availability – When the bootstrap process is complete, the Spark Server enters a waiting state, ready to accept incoming connections. The Spark Connect API endpoint becomes available on the configured port (typically 15002), listening for gRPC connection from remote clients.
    3. Client interaction – Spark Connect clients can establish secure connections to an Application Load Balancer. These clients translate DataFrame operations into unresolved logical query plans, encode these plans using protocol buffers, and send them to the Spark Connect API using gRPC.
    4. Encryption in transit – The Application Load Balancer receives incoming gRPC or HTTPS traffic, performs TLS termination (decrypting the traffic), and forwards the requests to the primary node. The certificate is stored in AWS Certificate Manager (ACM).
    5. Request processing – The Spark Connect API receives the unresolved logical plans, translates them into Spark’s built-in logical plan operators, passes them to Spark Core for optimization and execution, and streams results back to the client as Apache Arrow-encoded row batches.
    6. (Optional) Operational access – Administrators can securely connect to both primary and core nodes through Session Manager, a capability of AWS Systems Manager, enabling troubleshooting and maintenance without exposing SSH ports or managing key pairs.

    The following diagram depicts the architecture of this post’s demonstration for submitting Spark unresolved logical plans to EMR clusters using Spark Connect.

    Apache Spark Connect on Amazon EMR solution architecture diagram

    Apache Spark Connect on Amazon EMR solution architecture diagram

    Prerequisites

    To proceed with this post, ensure you have the following:

    Implementation steps

    In this recipe, through AWS CLI commands, you will:

    1. Prepare the bootstrap script, a bash script starting Spark Connect on Amazon EMR.
    2. Set up the permissions for Amazon EMR to provision resources and perform service-level actions with other AWS services.
    3. Create the Amazon EMR cluster with these associated roles and permissions and eventually attach the prepared script as a bootstrap action.
    4. Deploy the Application Load Balancer and certificate with ACM secure data in transit over the internet.
    5. Modify the primary node’s security group to allow Spark Connect clients to connect.
    6. Connect with a test application connecting the client to Spark Connect server.

    Prepare the bootstrap script

    To prepare the bootstrap script, follow these steps:

    1. Create an Amazon S3 bucket to host the bootstrap bash script:
      REGION=
      BUCKET_NAME=
      aws s3api create-bucket \
         --bucket $BUCKET_NAME \ 
         --region $REGION \
         --create-bucket-configuration LocationConstraint=$REGION

    2. Open your preferred text editor, add the following commands in a new file with a name such start-spark-connect.sh. If the script runs on the primary node, it starts Spark Connect server. If it runs on a task or core node, it does nothing:
      #!/bin/bash
      if grep isMaster /mnt/var/lib/info/instance.json | grep false;
      then
          echo "This is not master node, do nothing."
          exit 0
      fi
      echo "This is master, continuing to execute script"
      SPARK_HOME=/usr/lib/spark
      SPARK_VERSION=$(spark-submit --version 2>&1 | grep "version" | head -1 | awk '{print $NF}' | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')
      SCALA_VERSION=$(spark-submit --version 2>&1 | grep -o "Scala version [0-9.]*" | awk '{print $3}' | grep -oE '[0-9]+\.[0-9]+')
      echo "Spark version ${SPARK_VERSION} is installed under ${SPARK_HOME} running with scala version ${SCALA_VERSION}"
      sudo "${SPARK_HOME}"/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_"${SCALA_VERSION}:${SPARK_VERSION}"

    3. Upload the script into the bucket created in step 1:
      aws s3 cp start-spark-connect.sh s3://$BUCKET_NAME
      

    Set up the permissions

    Before creating the cluster, you must create the service role, and instance profile. A service role is an IAM role that Amazon EMR assumes to provision resources and perform service-level actions with other AWS services. An EC2 instance profile for Amazon EMR assigns a role to every EC2 instance in a cluster. The instance profile must specify a role that can access the resources for your bootstrap action.

    1. Create the IAM role:
      aws iam create-role \
      --role-name AmazonEMR-ServiceRole-SparkConnectDemo \
      --assume-role-policy-document '{
      	"Version": "2012-10-17",
      	"Statement": [{
      		"Effect": "Allow",
      		"Principal": {"Service": "elasticmapreduce.amazonaws.com"},
      		"Action": "sts:AssumeRole"
      		}]
      }'
      

    2. Attach the necessary managed policies to the service role to allow Amazon EMR to manage the underlying services Amazon EC2 and Amazon S3 on your behalf and optionally grant an instance to interact with Systems Manager:
      aws iam attach-role-policy \
      --role-name AmazonEMR-ServiceRole-SparkConnectDemo \
      --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2
      
      aws iam attach-role-policy \
      --role-name AmazonEMR-ServiceRole-SparkConnectDemo \
      --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      
      aws iam attach-role-policy \
      --role-name AmazonEMR-ServiceRole-SparkConnectDemo \
      --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole
      

    3. Create an Amazon EMR instance role to grant permissions to EC2 instances to interact with Amazon S3 or other AWS services:
      aws iam create-role \
      --role-name EMR_EC2_SparkClusterNodesRole \
      --assume-role-policy-document '{
      "Version": "2012-10-17",
      "Statement": [{
         "Effect": "Allow",
         "Principal": {"Service": "ec2.amazonaws.com"},
         "Action": "sts:AssumeRole"
         }]
      }'
      

    4. To allow the primary instance to read from Amazon S3, attach the AmazonS3ReadOnlyAccess policy to the Amazon EMR instance role. For production environments, this access policy should be reviewed and replaced with a custom policy following the principle of least privilege, granting only the specific permissions needed for your use case:
      aws iam attach-role-policy \
      --role-name EMR_EC2_SparkClusterNodesRole \
      --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
      

    5. Attaching AmazonSSMManagedInstanceCore policy enables the instances to use core Systems Manager features, such as Session Manager, and Amazon CloudWatch:
      aws iam attach-role-policy \
      --role-name EMR_EC2_SparkClusterNodesRole \
      --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      

    6. To pass the EMR_EC2_SparkClusterInstanceProfile IAM role information to the EC2 instances when they start, create the Amazon EMR EC2 instance profile:
      aws iam create-instance-profile \
      --instance-profile-name EMR_EC2_SparkClusterInstanceProfile
      

    7. Attach the role EMR_EC2_SparkClusterNodesRole created in step 3 to the newly instance profile:
      aws iam add-role-to-instance-profile \
      --instance-profile-name EMR_EC2_SparkClusterInstanceProfile \
      --role-name EMR_EC2_SparkClusterNodesRole
      

    Create the Amazon EMR cluster

    To create the Amazon EMR cluster, follow these steps:

    1. Set the environment variables, where your EMR cluster and load-balancer must be deployed:
      VPC_ID=
      EMR_PRI_SB_ID_1=
      ALB_PUB_SB_ID_1=
      ALB_PUB_SB_ID_2=
      

    2. Create the EMR cluster with the latest Amazon EMR release. Replace the placeholder value with your actual S3 bucket name where the bootstrap action script is stored:
      CLUSTER_ID=$(aws emr create-cluster \
      --name "Spark Connect cluster demo" \
      --applications Name=Spark \
      --release-label emr-7.9.0 \
      --service-role AmazonEMR-ServiceRole-SparkConnectDemo \
      --ec2-attributes InstanceProfile=EMR_EC2_SparkClusterInstanceProfile,SubnetId=$EMR_PRI_SB_ID_1 \
      --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m5.xlarge \
      --bootstrap-actions Path="s3://$BUCKET_NAME/start-spark-connect.sh" \
      --query 'ClusterId' --output text)
      echo CLUSTER_ID="$CLUSTER_ID"
      

      To modify primary node’s security group to allow Systems Manager to start a session.

    3. Get the primary node’s security group identifier. Record the identifier because you’ll need it for subsequent configuration steps in which primary-node-security-group-id is mentioned:
      PRIMARY_NODE_SG=$(aws emr describe-cluster \
      --cluster-id $CLUSTER_ID \
      --query 'Cluster.Ec2InstanceAttributes.EmrManagedMasterSecurityGroup' \
      --output text)
      echo PRIMARY_NODE_SG=$PRIMARY_NODE_SG
      

    4. Find the EC2 instance connect prefix list ID for your Region. You can use the EC2_INSTANCE_CONNECT filter with the describe-managed-prefix-lists command. Using a managed prefix list provides a dynamic security configuration to authorize Systems Manager EC2 instances to connect the primary and core nodes by SSH:
      IC_PREFIX_LIST=$(aws ec2 describe-managed-prefix-lists \
      --filters Name=prefix-list-name,Values=com.amazonaws.$REGION.ec2-instance-connect \
      --query 'PrefixLists[0].PrefixListId' \
      --output text)
      echo IC_PREFIX_LIST=$IC_PREFIX_LIST
      

    5. Modify the primary node security group inbound rules to allow SSH access (port 22) to the EMR cluster’s primary node from resources that are part of the specified Instance Connect service contained in the prefix list:
      aws ec2 authorize-security-group-ingress \
      --region $REGION \
      --group-id $PRIMARY_NODE_SG \
      --ip-permissions "[{\"IpProtocol\":\"tcp\",\"FromPort\":22,\"ToPort\":22,\"PrefixListIds\":[{\"PrefixListId\":\"$IC_PREFIX_LIST\"}]}]"
      

    Optionally, you can repeat the preceding steps 1–3 for the core (and tasks) cluster’s nodes to allow Amazon EC2 Instance Connect to access the EC2 instance through SSH.

    Deploy the Application Load Balancer and certificate

    To deploy the Application Load Balancer and certificate, follow these steps:

    1. Create a load balancer’s security group:
      ALB_SG_ID=$(aws ec2 create-security-group \
      --group-name spark-connect-alb-sg \
      --description "Security group for Spark Connect ALB" \
      --region $REGION \
      --vpc-id $VPC_ID \
      --query 'GroupId' \
      --output text)
      

    2. Add rule to accept TCP traffic from a trusted IP on port 443. We recommend that you use the local development machine’s IP address. You can check your current public IP address here: https://checkip.amazonaws.com:
      aws ec2 authorize-security-group-ingress \
      --group-id $ALB_SG_ID \
      --protocol tcp \
      --port 443 \
      --cidr /32
      

    3. Create a new target group with gRPC protocol, which targets the Spark Connect server instance and the port the server is listening to:
      ALB_TG_ARN=$(aws elbv2 create-target-group \
      --name spark-connect-tg \
      --protocol HTTP \
      --protocol-version GRPC \
      --port 15002 \
      --target-type instance \
      --health-check-enabled \
      --health-check-protocol HTTP \
      --health-check-path / \
      --vpc-id $VPC_ID \
      --query 'TargetGroups[0].TargetGroupArn' \
      --output text)
      echo "ALB TG created (ARN)=$ALB_TG_ARN"
      

    4. Create the Application Load Balancer:
      ALB_ARN=$(aws elbv2 create-load-balancer \
      --name spark-connect-alb \
      --type application \
      --scheme internet-facing \
      --subnets $ALB_PUB_SB_ID_1 $ALB_PUB_SB_ID_2 \
      --security-groups $ALB_SG_ID \
      --query 'LoadBalancers[0].LoadBalancerArn' \
      --output text)
      echo "ALB created (ARN)=$ALB_ARN"
      

    5. Get the load balancer DNS name:
      ALB_DNS=$(aws elbv2 describe-load-balancers \
      --load-balancer-arns $ALB_ARN \
      --query 'LoadBalancers[0].DNSName' \
      --output text)
      echo "ALB DNS=$ALB_DNS"
      

    6. Retrieve the Amazon EMR primary node ID:
      PRIMARY_NODE_ID=$(aws emr list-instances --cluster-id $CLUSTER_ID --instance-group-types MASTER --query 'Instances[0].Ec2InstanceId' --output text)
      echo PRIMARY_NODE_ID=$PRIMARY_NODE_ID
      

    7. (Optional) To encrypt and decrypt the traffic, the load balancer needs a certificate. You can skip this step if you already have a trusted certificate in ACM. Otherwise, create a self-signed certificate:
      PRIVATE_KEY_PATH=./sc-private-key.key
      CERTIFICATE_PATH=./sc-certificate.cert
      sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout $PRIVATE_KEY_PATH -out $CERTIFICATE_PATH -subj "/CN=$ALB_DNS"
      

    8. Upload to ACM:
      ACM_CERT_ARN=$(aws acm import-certificate \
      --certificate fileb://$CERTIFICATE_PATH \
      --private-key fileb://$PRIVATE_KEY_PATH \
      --region $REGION \
      --query CertificateArn \
      --output text)
      echo "Certificate created (ARN)=$ACM_CERT_ARN"
      

    9. Create the load balancer listener:
      ALB_LISTENER_ARN=$(aws elbv2 create-listener \
      --load-balancer-arn $ALB_ARN \
      --protocol HTTPS \
      --port 443 \
      --certificates CertificateArn=$ACM_CERT_ARN \
      --ssl-policy ELBSecurityPolicy-TLS13-1-2-2021-06 \
      --default-actions Type=forward,TargetGroupArn=$ALB_TG_ARN \
      --region $REGION \
      --query 'Listeners[0].ListenerArn' \
      --output text)
      echo "ALB listener created (ARN)=$ALB_LISTENER_ARN"
      

    10. After the listener has been provisioned, register the primary node to the target group:
      aws elbv2 register-targets \
      --target-group-arn $ALB_TG_ARN \
      --targets Id=$PRIMARY_NODE_ID
      

    Modify the primary node’s security group to allow Spark Connect clients to connect

    To connect to Spark Connect, amend only the primary security group. Add an inbound rule to the primary’s node security group to accept Spark Connect TCP connection on port 15002 from your chosen trusted IP address:

    aws ec2 authorize-security-group-ingress \
    --group-id $PRIMARY_NODE_SG \
    --protocol tcp \
    --port 15002 \
    --source-group $ALB_SG_ID
    

    Connect with a test application

    This example demonstrates that a client running a newer Spark version (4.0.1) can successfully connect to an older Spark version on the Amazon EMR cluster (3.5.5), showcasing Spark Connect’s version compatibility feature. This version combination is for demonstration only. Running older versions might pose security risks in production environments.

    To test the client-to-server connection, we provide the following test Python application. We recommend that you create and activate a Python virtual environment (venv) before installing the packages. This helps isolate the dependencies for this specific project and prevents conflicts with other Python projects. To install packages, run the following command:

    pip install pyspark-client==4.0.1
    

    In your integrated development environment (IDE), copy and paste the following code, replace the placeholder, and invoke it. The code creates a Spark DataFrame containing two rows and it shows its data:

    from pyspark.sql import SparkSession
    import os
    os.environ['GRPC_DEFAULT_SSL_ROOTS_FILE_PATH'] = os.path.expanduser('sc-certificate.cert')
    spark = SparkSession.builder \
        .remote("sc://:443/;use_ssl=true") \
        .config('spark.sql.execution.pandas.inferPandasDictAsMap', True) \
        .config('spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled', True) \
        .getOrCreate()
    spark.createDataFrame([("sue", 32),("li", 3)],["first_name", "age"]).show()
    

    The following shows the application output:

    +----------+---+
    |first_name|age|
    +----------+---+
    |       sue| 32|
    |        li|  3|
    +----------+---+
    

    Clean up

    When you no longer need the cluster, release the following resources to stop incurring charges:

    1. Delete the Application Load Balancer listener, target group, and the load balancer.
    2. Delete the ACM certificate.
    3. Delete the load balancer and Amazon EMR node security groups.
    4. Terminate the EMR cluster.
    5. Empty the Amazon S3 bucket and delete it.
    6. Remove AmazonEMR-ServiceRole-SparkConnectDemo and EMR_EC2_SparkClusterNodesRole roles and EMR_EC2_SparkClusterInstanceProfile instance profile.

    Considerations

    Security considerations with Spark Connect:

    • Private subnet deployment – Keep EMR clusters in private subnets with no direct internet access, using NAT gateways for outbound connectivity only.
    • Access logging and monitoring – Enable VPC Flow Logs, AWS CloudTrail, and bastion host access logs for audit trails and security monitoring.
    • Security group restrictions – Configure security groups to allow Spark Connect port (15002) access only from bastion host or specific IP ranges.

    Conclusion

    In this post, we showed how you can adopt modern development workflows and debug Spark applications from local IDEs or notebooks, so you can step through code execution. With Spark Connect’s client-server architecture, the Spark cluster can run on a different version than the client applications, so operations teams can perform infrastructure upgrades and patches independently.

    As the cluster operators gain experience, they can customize the bootstrap actions and add steps to process data. Consider exploring Amazon Managed Workflows for Apache Airflow (MWAA) for orchestrating your data pipeline.


    About the authors

    Philippe Wanner

    Philippe Wanner

    Philippe is EMEA Tech Lead at AWS. His role is to accelerate the digital transformation for large organizations. His current focus is in a multidisciplinary area involving business transformation, technical strategy, and distributed systems.

    Ege Oguzman

    Ege Oguzman

    Ege is a Software Development Engineer at AWS, and previously he was a Solutions Architect in the public sector. As a builder and cloud enthusiast, he specializes in distributed systems and dedicates his time to infrastructure development and helping organizations build solutions on AWS.



    Source link

    Amazon Apache Connect EC2 EMR Modernize Spark workflows
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    Databricks Spatial Joins Now 17x Faster Out-of-the-Box

    December 27, 2025

    What Amazon, Disney and Netflix now know

    December 27, 2025

    Edge Infrastructure Strategies for Data-Driven Manufacturers

    December 26, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    FaZe Clan’s future is uncertain after influencers depart

    December 27, 2025

    Airbus prepares tender for European sovereign cloud

    December 27, 2025

    Indie App Spotlight: ‘Cannot Ignore’ brings full screen alarms to your calendar and more

    December 27, 2025

    Canada broke its electric vehicle market in 2025 and it did so alone

    December 27, 2025
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    FaZe Clan’s future is uncertain after influencers depart

    December 27, 2025

    Airbus prepares tender for European sovereign cloud

    December 27, 2025

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2025 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.