Skip to main content
  1. Blog/

Defining Spark Schemas with Strings in DDL

·2 mins·

Defining a schema in Spark can sometimes be tedious and confusing. Also, there are some use cases (for example when writing tests, article TBD) where it is of advantage to define the schema in one string.

Schema definitions as a string follow the idea of the Data Definition Language. DDL expressions are for example used to create tables in SQL databases:

CREATE TABLE employees (id INTEGER not null, name VARCHAR(50) not null);

Here the expression

id INTEGER not null, name VARCHAR(50) not null

defines the schema of the table.

This notation can also be used to define schemas in Spark. For example, instead of defining:

from pyspark.sql import types as T

schema = T.StructType(
    [
        T.StructField("id", T.IntegerType(), False),
        T.StructField("name", T.StringType(), True),
    ]             
)

you can simply write

schema = "id int not null, name string"

To convert the string into an actual schema-object, you can use the following function: from pyspark.sql.types import _parse_datatype_string. However, when passing it to the spark.createDataFrame-method, this is not needed and you can directly pass the string.

Note that the notation also works for more complicated cases. However, when using nested schemas, you need to add colons between the column name and the data type. For example the following nested schema:

from pyspark.sql import types as T

schema = T.StructType(
    [
        T.StructField("id", T.StringType(), False),
        T.StructField("person", T.ArrayType(
            T.StructType([
                T.StructField("name", T.StringType(), False),
                T.StructField("age", T.IntegerType(), True),
            ])), False),
    ]             
)

becomes:

id: int not null, person: array<struct<name: string not null, age: int>>