localstack for local glue dev
trying out localstack to see if we can use it for local dev
i created a venv: python3 -m venv venv
activated it: source venv/bin/activate
install local stack per instructions here, decided to use cli since they recommended it as easiest way to install: python3 -m pip install localstack
running localstack --help
should bring up usage information
to run localstack in docker container run: localstack start
in venv you should be able to see logs: localstack logs
using awslocal
to use aws cli against local cloud — wrapper around awscli so we dont need to specify endpoint url: pip install awscli-local
create local bucket:
awslocal s3api create-bucket --bucket <local bucket name>
list buckets:
awslocal s3api list-buckets
upload file to local bucket:
awslocal s3api put-object --bucket <local bucket name> --key <prefix/keyname> --body <path/to/file>
next i try to run glue in a container, via aws instructions:
docker pull amazon/aws-glue-libs:glue_libs_2.0.0_image_01
run container
docker run -itd -p 8888:8888 -p 4040:4040 -e AWS_ACCESS_KEY_ID=test -e AWS_SECRET_ACCESS_KEY=test -e AWS_REGION=us-east-1 -e S3_ENDPOINT=http://localstack:4566 --name glue_jupyter amazon/aws-glue-libs:glue_libs_2.0.0_image_01 /home/jupyter/jupyter_start.sh
get an id returned but don’t see in docker ps
i try docker ps -a
i see container exited
i look at logs
docker logs -t glue_jupyter
and see
2022-10-04T19:41:35.142125900Z bash: /home/jupyter/jupyter_start.sh: No such file or directory
i run container alone for now and exec into it and i’m able to navigate to the jupyter_start.sh
script to run jupyter. i navigate to
http://127.0.0.1:8888/lab
but seeing errors:
[W 2022-10-04 20:53:31.237 ServerApp] SSL Error on 9 ('172.17.0.1', 63838): [SSL: HTTP_REQUEST] http request (_ssl.c:1091)
tried a couple things including mounting my local project directory to the container, via the -v
parameter
-v <local directory>:/home/jupyter/jupyter_default_dir
realized there’s an issue w/ SSL in the more recent glue image, so i run with disabled ssl for now:
docker run -itd -p 8888:8888 -p 4040:4040 -e DISABLE_SSL=true -e AWS_ACCESS_KEY_ID=test -e AWS_SECRET_ACCESS_KEY=test -e AWS_REGION=us-east-1 -e S3_ENDPOINT=http://localstack:4566 -v <local directory>:/home/jupyter/jupyter_default_dir --name glue_jupyter amazon/aws-glue-libs:glue_libs_2.0.0_image_01
also running pwd
in my container, it looks like the jupyter_start.sh
script is in a different path:
/home/glue_user/jupyter
running below works:
docker run -itd -p 8888:8888 -p 4040:4040 -e DISABLE_SSL=true -e AWS_ACCESS_KEY_ID=test -e AWS_SECRET_ACCESS_KEY=test -e AWS_REGION=us-east-1 -e S3_ENDPOINT=http://localstack:4566 -v <local directory>:/home/jupyter/jupyter_default_dir --name glue_jupyter amazon/aws-glue-libs:glue_libs_2.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh
verify container is running: docker ps
look at logs we see jupyter started: docker logs -t glue_jupyter
navigate to: http://127.0.0.1:8888/lab
trying to run the following:
from pyspark import SparkContext as sc
from awsglue.context import GlueContext
glueContext = GlueContext(sc.getOrCreate())
inputDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://<local-bucket>/<file-name.json>"]}, format = "json")
inputDF.toDF().show()
i get error:
Py4JJavaError: An error occurred while calling o88.getDynamicFrame.
: java.nio.file.AccessDeniedException: s3://<local-bucket>/<file-name>.json: getFileStatus on ss3://<local-bucket>/<file-name>.json: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden
i use boto3 to confirm i can connect to my s3 local backend running in container:
s3 = boto3.client(
's3',
endpoint_url="http://host.docker.internal:4566",
use_ssl=False,
aws_access_key_id='test',
aws_secret_access_key='test',
region_name='us-east-1'
)
for key in s3.list_objects(Bucket='bucket-name')['Contents']:
print(key['Key'])
i just need to modify my glueContext to use custom endpoint or some other way to connect to this local backend.
i find this:
sc.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
sc.hadoopConfiguration.set("fs.s3a.access.key","<<ACCESS_KEY>>");
sc.hadoopConfiguration.set("fs.s3a.secret.key","<<SECRET_KEY>>");
but i get an attribute error:
AttributeError: type object 'SparkContext' has no attribute 'hadoopConfiguration'
i realized this is for scala and python equivalent is:
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "http://host.docker.internal:4566");
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key","test");
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key","test");
apparently _jsc
is no longer available
you can set it this way:
import pyspark
conf = pyspark.SparkConf().setAll([('fs.s3a.endpoint', 'http://host.docker.internal:4566'), ('fs.s3a.access.key', 'test'), ('fs.s3a.secret.key', 'test')])
sc = pyspark.SparkContext(conf=conf)
and run this to confirm:
sc.getConf().getAll()
but still getting access denied, ended up doing a few things:
- instead of fs.s3a.access.key need to add `spark.hadoop`
- needed to add `spark.hadoop.fs.s3a.path.style.access` setting as well
what i used finally to get my local glue container working with a localstack s3 backend:
import pyspark
conf = pyspark.SparkConf().setAll(
[('spark.hadoop.fs.s3a.endpoint', 'http://host.docker.internal:4566'),
('spark.hadoop.fs.s3a.access.key', 'test'),
('spark.hadoop.fs.s3a.secret.key', 'test'),
('spark.hadoop.fs.s3a.secret.key', 'test'),
('spark.hadoop.fs.s3a.path.style.access', 'true')])
sc = pyspark.SparkContext(conf=conf)
Update: December 19, 2023
- an update if you are using local stack s3 back end with the aws glue image in a local container to run glue jobs.
- when reading files into a glue dynamic frame, if you are getting this error:
An error occurred while calling o62.getDynamicFrame.
: java.nio.file.AccessDeniedException: s3://<bucket/prefix>: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by DefaultAWSCredentialsProviderChain : com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), WebIdentityTokenCredentialsProvider: You must specify a value for roleArn and roleSessionName, com.amazonaws.auth.profile.ProfileCredentialsProvider@5731e4ca: profile file cannot be null, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@7a5310ef: Failed to connect to service endpoint: ]
the solution is to add this spark setting:
(“spark.hadoop.fs.s3a.aws.credentials.provider”, “org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider”)