localstack for local glue dev

diary of a codelovingyogi
4 min readOct 13, 2022

trying out localstack to see if we can use it for local dev

i created a venv: python3 -m venv venv

activated it: source venv/bin/activate

install local stack per instructions here, decided to use cli since they recommended it as easiest way to install: python3 -m pip install localstack

running localstack --help should bring up usage information

to run localstack in docker container run: localstack start

in venv you should be able to see logs: localstack logs

using awslocal to use aws cli against local cloud — wrapper around awscli so we dont need to specify endpoint url: pip install awscli-local

create local bucket:

awslocal s3api create-bucket --bucket <local bucket name>

list buckets:

awslocal s3api list-buckets

upload file to local bucket:

awslocal s3api put-object --bucket <local bucket name> --key <prefix/keyname> --body <path/to/file>

next i try to run glue in a container, via aws instructions:

docker pull amazon/aws-glue-libs:glue_libs_2.0.0_image_01

run container

docker run -itd -p 8888:8888 -p 4040:4040 -e AWS_ACCESS_KEY_ID=test -e AWS_SECRET_ACCESS_KEY=test -e AWS_REGION=us-east-1 -e S3_ENDPOINT=http://localstack:4566 --name glue_jupyter amazon/aws-glue-libs:glue_libs_2.0.0_image_01 /home/jupyter/jupyter_start.sh

get an id returned but don’t see in docker ps

i try docker ps -ai see container exited

i look at logs

docker logs -t glue_jupyter

and see

2022-10-04T19:41:35.142125900Z bash: /home/jupyter/jupyter_start.sh: No such file or directory

i run container alone for now and exec into it and i’m able to navigate to the jupyter_start.sh script to run jupyter. i navigate to

http://127.0.0.1:8888/lab

but seeing errors:

[W 2022-10-04 20:53:31.237 ServerApp] SSL Error on 9 ('172.17.0.1', 63838): [SSL: HTTP_REQUEST] http request (_ssl.c:1091)

tried a couple things including mounting my local project directory to the container, via the -v parameter

-v <local directory>:/home/jupyter/jupyter_default_dir

realized there’s an issue w/ SSL in the more recent glue image, so i run with disabled ssl for now:

docker run -itd -p 8888:8888 -p 4040:4040 -e DISABLE_SSL=true -e AWS_ACCESS_KEY_ID=test -e AWS_SECRET_ACCESS_KEY=test -e AWS_REGION=us-east-1 -e S3_ENDPOINT=http://localstack:4566 -v <local directory>:/home/jupyter/jupyter_default_dir --name glue_jupyter amazon/aws-glue-libs:glue_libs_2.0.0_image_01

also running pwd in my container, it looks like the jupyter_start.sh script is in a different path:

/home/glue_user/jupyter

running below works:

docker run -itd -p 8888:8888 -p 4040:4040 -e DISABLE_SSL=true -e AWS_ACCESS_KEY_ID=test -e AWS_SECRET_ACCESS_KEY=test -e AWS_REGION=us-east-1 -e S3_ENDPOINT=http://localstack:4566 -v <local directory>:/home/jupyter/jupyter_default_dir --name glue_jupyter amazon/aws-glue-libs:glue_libs_2.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh

verify container is running: docker ps

look at logs we see jupyter started: docker logs -t glue_jupyter

navigate to: http://127.0.0.1:8888/lab

trying to run the following:

from pyspark import SparkContext as sc
from awsglue.context import GlueContext
glueContext = GlueContext(sc.getOrCreate())
inputDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://<local-bucket>/<file-name.json>"]}, format = "json")
inputDF.toDF().show()

i get error:

Py4JJavaError: An error occurred while calling o88.getDynamicFrame.
: java.nio.file.AccessDeniedException: s3://<local-bucket>/<file-name>.json: getFileStatus on ss3://<local-bucket>/<file-name>.json: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden

i use boto3 to confirm i can connect to my s3 local backend running in container:

s3 = boto3.client(
's3',
endpoint_url="http://host.docker.internal:4566",
use_ssl=False,
aws_access_key_id='test',
aws_secret_access_key='test',
region_name='us-east-1'
)
for key in s3.list_objects(Bucket='bucket-name')['Contents']:
print(key['Key'])

i just need to modify my glueContext to use custom endpoint or some other way to connect to this local backend.

i find this:

sc.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
sc.hadoopConfiguration.set("fs.s3a.access.key","<<ACCESS_KEY>>");
sc.hadoopConfiguration.set("fs.s3a.secret.key","<<SECRET_KEY>>");

but i get an attribute error:

AttributeError: type object 'SparkContext' has no attribute 'hadoopConfiguration'

i realized this is for scala and python equivalent is:

sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "http://host.docker.internal:4566");
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key","test");
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key","test");

apparently _jsc is no longer available

you can set it this way:

import pyspark
conf = pyspark.SparkConf().setAll([('fs.s3a.endpoint', 'http://host.docker.internal:4566'), ('fs.s3a.access.key', 'test'), ('fs.s3a.secret.key', 'test')])
sc = pyspark.SparkContext(conf=conf)

and run this to confirm:

sc.getConf().getAll()

but still getting access denied, ended up doing a few things:

  • instead of fs.s3a.access.key need to add `spark.hadoop`
  • needed to add `spark.hadoop.fs.s3a.path.style.access` setting as well

what i used finally to get my local glue container working with a localstack s3 backend:

import pyspark
conf = pyspark.SparkConf().setAll(
[('spark.hadoop.fs.s3a.endpoint', 'http://host.docker.internal:4566'),
('spark.hadoop.fs.s3a.access.key', 'test'),
('spark.hadoop.fs.s3a.secret.key', 'test'),
('spark.hadoop.fs.s3a.secret.key', 'test'),
('spark.hadoop.fs.s3a.path.style.access', 'true')])
sc = pyspark.SparkContext(conf=conf)

Update: December 19, 2023

  • an update if you are using local stack s3 back end with the aws glue image in a local container to run glue jobs.
  • when reading files into a glue dynamic frame, if you are getting this error:
An error occurred while calling o62.getDynamicFrame.
: java.nio.file.AccessDeniedException: s3://<bucket/prefix>: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by DefaultAWSCredentialsProviderChain : com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), WebIdentityTokenCredentialsProvider: You must specify a value for roleArn and roleSessionName, com.amazonaws.auth.profile.ProfileCredentialsProvider@5731e4ca: profile file cannot be null, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@7a5310ef: Failed to connect to service endpoint: ]

the solution is to add this spark setting:

(“spark.hadoop.fs.s3a.aws.credentials.provider”, “org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider”)

--

--