Read parquet files from Azure Data Lake with Apache Spark
Azure Blob Storage - and especially the Datalake Gen 2 which is built on top and allows additional features like hierarchical namespace - is a great place to store data in all kinds of format: a data lake.
Here, I explain how to connect to this data lake with Apache Spark and read in parquet-files from the data lake.
- Retrieve the access key from the Azure portal. Go to the storage account you want to connect to, choose “Access keys” under “Security + networking” and copy one of the keys.
- When creating your Spark Session in your application or notebook, you need to add one package that will be automatically downloaded and included in the session:
from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local") \ .config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1')\ .getOrCreate()
- Hand over the key from (1) to the Spark config. Note that the key allows access to your data and is therefore a secret! You should handle it properly, for example with a local env-file that is not checked into version control, or better something like a key vault.
spark.conf.set( "fs.azure.account.key.<storage account name>.dfs.core.windows.net", "<storage account key>" )
- You can now read in parquet files:
Note that the container name comes before the “@"-sign but does not appear within the file path.df = spark.read.format("parquet")\ .load("abfss://<container name>@<storage account name>.dfs.core.windows.net/<file path>")