Creating a GraphQL API from a Data Lake: Data Engineering (Part 1)

Kenzo Castañeda
10 min readJul 18, 2023

--

I’m a Solution Architect and Data Engineer in our Integrations team at work, and I own one of our API products, a Data Lake GraphQL API. In this article, I wanted to share with you the process of how I built it — starting from the data side.

This article assumes you’re familiar with Azure services such as Data Lake Storage, Data Factory, and Databricks, and have some general knowledge of GraphQL, Python/PySpark, and SQL.

Problem

Data lakes are designed to store large amounts of data usually meant for data analytics and reporting, but one of our use cases at work was to turn that Data Lake into a source for a GraphQL API. Sort of like a Reverse ETL but applications (both Internal and SaaS) will call our API to get the data they need.

Why? Well, it’s a central data source that contains everything our users need, and the data is highly relational which is perfect for GraphQL.

There are a few issues though, primarily with the data:

  1. We have no control over the data lake and the data that goes in there since another group owns that. So we take what they can give.
  2. Data is stored as parquet files in the lake. We need to move that to an SQL database to be able to use an ORM and take advantage of that relational goodness.
  3. In our Data Lake, there’s no enforcement of primary key uniqueness, meaning that data is sometimes ingested without deduplication.
  4. No fields can be used to signify that the data is new or updated, such as an updated_at field, so there was no easy way to get the delta.
  5. Our data lake uses the Star Schema model, and some fact tables can get really huge (~70 million rows), so ingesting them from scratch every time would take almost 8 hours.

Design

Our API’s high-level architecture looks like this:

ADLS2 (Data Lake Blob Storage) → Data Factory/Databricks → SQL Server → Azure Function/GraphQL Server

In Part 1 of this series, I’ll discuss the first three stages and tackle the data engineering aspect first.

Let’s go through every issue above and see what we can do about it.

  1. Data Quality Control — we can create our own pipelines to have more control over the data we get from the lake.
  2. Parquet to SQL Transformation — we can do this in Databricks.
  3. Deduplication — we can also do this in Databricks.
  4. Change Data Capture — I needed to implement our own CDC system due to the lack of any versioning, update timestamps, or logs from our source. What worked for us is creating a hash of all the columns and storing that as a separate column. Basically, a checksum column. If a single column’s value changes, the hash would also change.
  5. Large Data Volume — I needed to create a delta load pipeline that can detect and extract only the new and updated data since the last load to minimize the volume of data written to our database. This pipeline should include a mechanism for identifying changes in the data (inserts, updates, and deletes) and reflect that in our SQL database.

Let’s break the above into just three concrete steps for our pipeline, which is the result of months of testing and development:

  1. Sanitation
  2. Deduplication
  3. Delta Loading

Sanitation

We need to sanitize and modify the source tables to ensure data quality before ingesting it into the SQL database. This includes filtering records, filling in null values, and removing data quality issues.

Below are snippets of code from our Databricks notebook (which runs on Apache Spark):

The source table is read from ADLS2 and returned as a DataFrame.

# Read the source table
dfSource = (spark.read
.format('parquet')
.option("mode", "FAILFAST")
.load(adls2sourcepath))

The DataFrame is sanitized depending on what table it is.

# do source table modifications
if pTableName == 'fact_table_1':
dfSource = dfSource.filter(f"fiscal_year_num >= '{datetime.datetime.now().year - 1}'") # filter records to current year-1
dfSource = dfSource.fillna(value='',subset=["tdcval_code"]) #tdcval_code is part of the primary key and cannot be null
elif pTableName == 'fact_table_2':
dfSource = dfSource.filter("contract_id is not null") # remove empty Contract id's that were data issues
dfSource = dfSource.fillna(value=0,subset=["allocation_id"]) #allocation_id is part of the primary key and cannot be null
elif pTableName == 'fact_table_3':
dfSource = dfSource.filter(f"doc_post_date >= '{datetime.datetime.now().year - 1}'") # filter records to current year-1
elif pTableName == 'fact_table_4':
dfSource = dfSource.filter(f"partition_key > '201712'") # filter for data quality issues. If you remove that filter data before that time is not of good quality.

Deduplication

To deduplicate, we need to know the primary keys. To do this dynamically, I created a function in SQL Server that outputs the table definition.

This returns each column’s name, type, length, precision, scale, and if it’s nullable or a primary key.

CREATE FUNCTION [config].[func_get_table_definition]
(
@SchemaName [NVARCHAR](128),@TableName [NVARCHAR](225)
)
RETURNS TABLE
AS
RETURN
(
SELECT
c.name 'column_name',
t.Name 'data_type',
c.max_length/2 'length',
c.precision ,
c.scale ,
c.is_nullable,
ISNULL(i.is_primary_key, 0) 'is_pk'
FROM
sys.columns c
INNER JOIN
sys.types t ON c.user_type_id = t.user_type_id
-- OUTER APPLY selects a single row that matches each row from the left table.
OUTER APPLY
(
SELECT TOP 1 *
FROM sys.index_columns M
WHERE M.object_id = c.object_id AND M.column_id = c.column_id
) m
LEFT OUTER JOIN
sys.indexes i ON m.object_id = i.object_id AND m.index_id = i.index_id
WHERE
c.object_id = OBJECT_ID(@SchemaName + '.' + @TableName)
)

Reference: https://stackoverflow.com/questions/2418527/sql-server-query-to-get-the-list-of-columns-in-a-table-along-with-data-types-no

Example output for a workflow_status_dim table, which I’ll be using in the rest of the examples here:

| column_name                 | data_type | length | precision | scale | is_nullable | is_pk |
| --------------------------- | --------- | ------ | --------- | ----- | ----------- | ----- |
| workflow_status | nvarchar | 10 | 0 | 0 | FALSE | TRUE |
| workflow_status_description | nvarchar | 35 | 0 | 0 | TRUE | FALSE |

To run the function in Databricks, we just read it like any other table via a SELECT statement:

# Read target table definition
dfTargetDef = spark.read \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("query", "select * from config.func_get_table_definition('{0}', '{1}')".format(pSchemaName, pTableName)) \
.option("user", dbUser) \
.option("password", dbPass).load()

Once we have the primary keys, we can deduplicate using PySpark’s dropDuplicates method:

# dedup source data according to primary key columns

pkColumns = []

pkColumnsDf = dfTargetDef.filter(dfTargetDef.is_pk == 'true')
for row in pkColumnsDf.collect():
pkColumns.append(f"{row.column_name}")
if('char' in row.data_type): dfSource = dfSource.withColumn(row.column_name, trim(col(row.column_name))) # trim string primary columns, remove leading and trailing spaces. Spaces cause issues in merging/primary keys

display(dfSource.groupBy(pkColumns).agg(count(
'*').alias("count_duplicates")).filter(
col('count_duplicates') >= 2)) # display the duplicate records

dfSourceCount = dfSource.count()
dfSource = dfSource.dropDuplicates(pkColumns)
dfSourceDedupCount = dfSource.count()
dupCount = dfSourceCount - dfSourceDedupCount

print(dupCount)

# Create a temp view of the source table
dfSource.createOrReplaceTempView("SourceTable")

Aside from deduplication, we also do a minor but important sanitation step which is trimming the char-type primary key columns. Rogue spaces in keys can cause issues when merging the data (unfortunately, this was based on experience).

Delta Loading

Generating the Checksum

The first step in the delta-loading process is creating the checksum column in the source.

The code below dynamically generates an SQL statement that creates a delta table named {pTableName}_src within the Databricks HMS (Hive Metastore). This contains the source table columns plus additional columns such as:

  • check_sum which contains the hash of all columns created using the built-in HASH function of PySpark
  • date columns (modified and created date)
  • flag columns (is_modified and is_deleted) to be able to filter out the changes
# Create source table with check_sum hash
spark.sql("REFRESH TABLE SourceTable") # Invalidates the cached entries
spark.sql(f"DROP TABLE IF EXISTS {pTableName}_src")

entries = [f"CREATE TABLE {pTableName}_src AS SELECT"]
for row in dfTargetDef.collect():
entries.append(f"SourceTable.{row.column_name},")

entries.append("HASH (")
for idx,row in dfTargetDef.toPandas().iterrows():
entries.append(f"SourceTable.{row.column_name}{'' if idx == len(dfTargetDef.toPandas().index)-1 else ','}")

entries.append(") as check_sum,")
entries.append("current_timestamp() as created_date,\ncurrent_timestamp() as modified_date,\nfalse as is_modified,\nfalse as is_deleted")
entries.append("FROM SourceTable;")

sql_comm = "\n".join(str(x) for x in entries)
spark.sql(sql_comm)

This generates an SQL statement like below:

CREATE TABLE workflow_status_dim_src AS SELECT
SourceTable.workflow_status,
SourceTable.workflow_status_description,
HASH (
SourceTable.workflow_status,
SourceTable.workflow_status_description
) as check_sum,
current_timestamp() as created_date,
current_timestamp() as modified_date,
false as is_modified,
false as is_deleted
FROM SourceTable;

Merging the Data

We first read the target table in our SQL database (our working tables are in the api schema). We store this as {pTableName}_dlt in Databricks.

dfTarget = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", "api." + pTableName) \
.option("user", dbUser) \
.option("password", dbPass).load()

dfTarget.createOrReplaceTempView("TargetTable")

# Create the delta/managed tables. This initially loads the rows from the source and target tables.
spark.sql(f"CREATE TABLE IF NOT EXISTS {pTableName}_dlt USING delta AS SELECT * FROM TargetTable;")

Next is to merge the source with the target table. The below code dynamically generates a MERGE statement to merge {pTableName}_src into {pTableName}_dlt.

# Merge source into target table

entries = [f"MERGE INTO {pTableName}_dlt tgt USING {pTableName}_src src ON ("]

# Get primary keys
df_pk = dfTargetDef.filter(dfTargetDef.is_pk == 'true').toPandas()
for idx,row in df_pk.iterrows():
entries.append(f"src.{row.column_name} = tgt.{row.column_name}{'' if idx == len(df_pk.index)-1 else ' AND'}")

entries.append(') WHEN MATCHED AND tgt.check_sum <> src.check_sum THEN UPDATE SET')

# https://www.geeksforgeeks.org/how-to-iterate-over-rows-and-columns-in-pyspark-dataframe/
for row in dfTargetDef.collect():
entries.append(f"tgt.{row.column_name} = src.{row.column_name},")

entries.append('tgt.check_sum = src.check_sum,\ntgt.modified_date = src.modified_date,\ntgt.is_modified = true,\ntgt.is_deleted = false')

entries.append('WHEN NOT MATCHED BY TARGET THEN INSERT (')
for row in dfTargetDef.collect():
entries.append(f"{row.column_name},")

entries.append('check_sum,\ncreated_date,\nmodified_date,\nis_modified,\nis_deleted')
entries.append(') VALUES (')

for row in dfTargetDef.collect():
entries.append(f"src.{row.column_name},")

entries.append('src.check_sum,\nsrc.created_date,\nsrc.modified_date,\ntrue,\nfalse')
entries.append(') WHEN NOT MATCHED BY SOURCE THEN UPDATE SET tgt.is_deleted = true;')

sql_comm = "\n".join(str(x) for x in entries)
spark.sql(sql_comm)

This generates an SQL statement like below:

MERGE INTO workflow_status_dim_dlt tgt USING workflow_status_dim_src src ON (
src.workflow_status = tgt.workflow_status
) WHEN MATCHED AND tgt.check_sum <> src.check_sum THEN UPDATE SET
tgt.workflow_status = COALESCE(src.workflow_status, tgt.workflow_status),
tgt.workflow_status_description = COALESCE(src.workflow_status_description, tgt.workflow_status_description),
tgt.check_sum = src.check_sum,
tgt.modified_date = src.modified_date,
tgt.is_modified = true,
tgt.is_deleted = false
WHEN NOT MATCHED BY TARGET THEN INSERT (
workflow_status,
workflow_status_description,
check_sum,
created_date,
modified_date,
is_modified,
is_deleted
) VALUES (
src.workflow_status,
src.workflow_status_description,
src.check_sum,
src.created_date,
src.modified_date,
true,
false
) WHEN NOT MATCHED BY SOURCE THEN UPDATE SET tgt.is_deleted = true;
  • If the primary keys match and the checksums don’t, that means the record has been updated. Set is_modified to True.
  • When no primary keys match, that means it’s a new record. Insert and set is_modified to True.
  • When the primary key isn’t present in the source, then it’s a deleted record. Set is_deleted to True.

Writing the Changes

Once the delta tables have been merged, we write the changes (those with True is_modified or is_deleted) to our SQL database in another schema (we use delta).

We use Microsoft’s custom Spark connector for SQL Server to take advantage of the BULK INSERT functionality and make our writes a lot quicker.

dfWrite = spark.sql(f"SELECT * FROM {pTableName}_dlt where is_modified=True OR is_deleted=True;")

dfWrite.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("url", jdbcUrl) \
.option("dbtable", "delta." + pTableName) \
.option("reliabilityLevel", "BEST_EFFORT") \
.option("batchsize", 100000) \
.option("tableLock", "true") \
.option("user", dbUser) \
.option("password", dbPass) \
.option("schemaCheckEnabled", "false") \
.save()

Since we write the changes to another table, we have to merge that to our main table in the api schema. This time, the MERGE statement runs in a stored procedure in the SQL database, not in Databricks.

Since columns get changed from time to time in our data lake tables and there are multiple tables in the lake, I had to make the stored procedure dynamic. To do that, I created the procedure via Databricks.

I tapped directly into the database connection object to be able to create a stored procedure in the SQL database.

# Create stored procedure code to merge delta table to api table in database
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
con = driver_manager.getConnection(jdbcUrl, dbUser, dbPass)

exec_statement = con.prepareCall(f"IF (OBJECT_ID('config.delta_load_{pTableName}', 'P') IS NOT NULL) DROP PROCEDURE [config].[delta_load_{pTableName}];")
exec_statement.execute()

entries = [f"CREATE PROCEDURE [config].[delta_load_{pTableName}] AS BEGIN"]

entries.append(f"MERGE INTO [api].[{pTableName}] AS tgt USING [delta].[{pTableName}] src ON (") # https://stackoverflow.com/questions/10724348/merge-violation-of-primary-key-constraint

# Get primary keys
df_pk = dfTargetDef.filter(dfTargetDef.is_pk == 'true').toPandas()
for idx,row in df_pk.iterrows():
entries.append(f"src.{row.column_name} = tgt.{row.column_name}{'' if idx == len(df_pk.index)-1 else ' AND'}")

entries.append(') WHEN MATCHED AND src.is_modified = 1 THEN UPDATE SET')

for row in dfTargetDef.collect():
entries.append(f"tgt.{row.column_name} = src.{row.column_name},")

entries.append('tgt.check_sum = src.check_sum,\ntgt.modified_date = src.modified_date')
entries.append('WHEN MATCHED AND src.is_deleted = 1 THEN DELETE WHEN NOT MATCHED THEN INSERT (')

for row in dfTargetDef.collect():
entries.append(f"{row.column_name},")

entries.append('check_sum,\ncreated_date,\nmodified_date')
entries.append(') VALUES (')

for row in dfTargetDef.collect():
entries.append(f"src.{row.column_name},")

entries.append('src.check_sum,\nsrc.created_date,\nsrc.modified_date')
entries.append(f'); DROP TABLE [delta].{pTableName}; END')

sql_proc = "\n".join(str(x) for x in entries)

# create stored procedure in database
exec_statement = con.prepareCall(sql_proc)
exec_statement.execute()

# Close connections
exec_statement.close()
con.close()

This generates an SQL statement like below:

CREATE PROCEDURE [config].[delta_load_workflow_status_dim] AS BEGIN
MERGE INTO [api].[workflow_status_dim] tgt USING [delta].[workflow_status_dim] src ON (
src.workflow_status = tgt.workflow_status
) WHEN MATCHED AND src.is_modified = 1 THEN UPDATE SET
tgt.workflow_status = COALESCE(src.workflow_status, tgt.workflow_status),
tgt.workflow_status_description = COALESCE(src.workflow_status_description, tgt.workflow_status_description),
tgt.check_sum = src.check_sum,
tgt.modified_date = src.modified_date
WHEN MATCHED AND src.is_deleted = 1 THEN DELETE WHEN NOT MATCHED THEN INSERT (
workflow_status,
workflow_status_description,
check_sum,
created_date,
modified_date
) VALUES (
src.workflow_status,
src.workflow_status_description,
src.check_sum,
src.created_date,
src.modified_date
); DROP TABLE [delta].workflow_status_dim; END
  • If the primary keys match and is_modified is True, that means the record has been updated → UPDATE.
  • If the primary keys match and is_deleted is True, that means the record has been deleted→ DELETE.
  • When no primary keys match, that means it’s a new record → INSERT.
  • Once the merge is done, we drop the table in the delta schema.

Orchestration

Last but not the least is to orchestrate the whole process. For this, we use Data Factory.

The above flow is what we use for each table in the Data lake.

  1. Record the start date
  2. Run the Databricks notebook to load the changes to the SQL DB
  3. If there are changes, run the stored procedure to merge the changes in the SQL DB
  4. If steps 2 or 3 fail, roll back the tables in Databricks using RESTORE TABLE to make sure their states are the same as the tables in the SQL DB.
  5. Finally, run a stored procedure to save metadata for debugging & analytics purposes (errors, number of rows added/deleted/changed, etc.). We maintain this in another table in the SQL DB.

Results

With our delta load implementation, we’ve seen a 70% decrease in data load times vs. loading each table from scratch from the lake to our SQL database every time (which was our very first iteration).

Before, our data pipelines took 7–10 hours to complete each day, but now it just takes 1–2 hours. This highly depends on your compute capacity, but in our case ( Standard_DS3_v2 2–8 worker Databricks cluster and GP_S_Gen5_12 SQL database) assuming a change of 10% or less in the datasets:

  • A table in the tens of thousands of rows just takes a minute to load the new data.
  • A table with a million rows takes about 10 minutes or less.
  • A table with about ~80 million rows takes around an hour.

Let me know if you have comments, questions, or suggestions below, or connect with me at hello@kenzojrc.com or my socials. If this article helped you in any way, it would also be great to hear it!

Stay tuned for the next.

Originally published at https://kenzojrc.com on July 18, 2023.

--

--