pyspark connect to aws s3a filesystem

diary of a codelovingyogi
3 min readMay 22, 2020

jar dependencies are very finicky

if you can’t run pyspark without errors via cli in your venv or wherever pyspark is installed you’ll likely encounter errors in your code.

trying to connect via:

spark = SparkSession.builder \
.appName("my_app") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")

during read of s3 file:

df = spark.read.json(file_path_name)

firstly encountered:

Py4JJavaError: An error occurred while calling o31.json.
: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities

updated jar to:

hadoop-common-3.1.1.jar

then got this error:

Py4JJavaError: An error occurred while calling o31.json.
: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.getTimeDuration(Ljava/lang/String;Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J

on mvnrepository, hadoop-aws-3.1.1.jar has certain aws-java-sdk dependencies:

i have versions:

you can view all aws sdk and fetch versions here:

https://mvnrepository.com/artifact/com.amazonaws

Possible that most of the hadoop jars are 2.7.3 in my pyspark install, so need to find aws java sdk that is compatible with this version:

Looks like this is 1.7.4:

Updated to the latest versions I can find in mvnrepository:

Now getting message:

Py4JJavaError: An error occurred while calling o31.json.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)

According to AWS docs:

When you initialize a new service client without supplying any arguments, the AWS SDK for Java attempts to find AWS credentials by using the default credential provider chain implemented by the DefaultAWSCredentialsProviderChain class. 

Ways to store aws credentials:

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html

Even though I have aws credentials in both .aws / credentials/profile files — still getting AWSCredentialsProviderChain error.

If you see this error in terminal:

20/05/12 20:15:48 WARN BasicProfileConfigLoader: Your profile name includes a 'profile ' prefix. This is considered part of the profile name in the Java SDK, so you will need to include this prefix in your profile name when you reference this profile from your Java code.

Try exporting an environment variable AWS_PROFILE=<profile name stored in config file>

It didn’t work for me, so to work around it, for local development/testing — I read secrets in from AWS secrets manager.

Update 7/21/2020:

These are the settings we currently use (above was troubleshooting for use of aws s3a with older version of AWS EMR):

spark = SparkSession.builder \
.appName("my_app") \
.config('spark.sql.codegen.wholeStage', False) \
.getOrCreate()

spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", aws_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", aws_secret_key)

spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "us-east-1.amazonaws.com")

sc = spark.sparkContext

With dependencies:

jets3t-0.9.3.jar
hadoop-aws-2.7.2.jar
guava-11.0.2.jar
hadoop-client-2.7.3.jar
hadoop-lzo-0.4.20.jar

With the following EMR version:

Release label:emr-5.23.0Hadoop distribution:AmazonApplications:Spark 2.4.0, JupyterHub 0.9.4

--

--