Exploring Data Modelling with ChatGPT: Part 2 Model Population

Part 2: Model Population with ChatGPT/Microsoft Fabric

Published in

GoPenAI

6 min readJun 13, 2023

Part 2: Model Population with ChatGPT/Microsoft Fabric

Welcome back to Part 2 of our data modelling journey! In Part 1: Manual Experiments with ChatGPT, we delved into the foundations of data modelling, exploring its significance and how it is impacted by the sudden rise of AI. We took a hands-on approach, with the help of chatGPT, conducting manual tasks to understand the subject comprehensively.

In Part 2, we’re ready to take our data modelling skills to the next level (we are data engineering). Our focus shifts to implementing the physical data model, bringing our concepts and theories to life. We’ll navigate the intricacies of data integration, transformation, and loading. We slowly start uncovering the power of automation along the way.

Throughout this phase, we’ll bridge the gap between theory and practice, exploring data transformation from our data source, making meaningful changes, and leveraging some level of automation to streamline our processes.

Whether you’re an experienced data professional or new to the field, this journey promises exciting insights and practical knowledge that you can apply to your projects. So, let’s dive in and embark on this transformative phase of our data modelling adventure together!

Getting started:

Catching Up on the Path to the Logical Data Model

It’s important to read this article to understand how we got to the dimensional data model that forms the foundation of this experiment.

I would have preferred to conduct the entire experiment on Kaggle since our dataset originates from there. However, Kaggle doesn’t offer the storage capability to materialise our dimensional model. Fortunately, Microsoft has recently launched Microsoft Fabric, a new platform we can explore in this experiment. Taking advantage of the (Preview) trial is a bonus as I enjoy using freebies.

Start the Fabric (Preview) trial

Go to https://learn.microsoft.com/ and Follow these steps to start your Fabric (Preview) trial,

As a word of caution, I would advise against using trial software with your corporate data. It’s important to prioritise the security and integrity of your data, and relying on trial versions may not provide the level of reliability and support required for such sensitive information.

Fabric (preview) trial - Microsoft Fabric

Microsoft Fabric has launched as a public preview and is temporarily provided free of charge when you sign up for the…

learn.microsoft.com

Get the data required.

We will reuse the same data set from our previous experiment in this new experiment. If you have a copy of it, that’s great! If not, no worries. We can download it again from this Kaggle Open Dataset.

For this experiment, we will be working with notebooks. So, in your fabric home screen, open the notebook, and you may see a pop-up suggesting an upgrade to your trial. Don’t worry, it’s still free, so go with the flow and proceed.

pop-up suggesting an upgrade to your trial

After the upgrade, it’s time to create a brand new Lakehouse. I decided to give mine a creative name: LKH_HRCaseStudy. This Lakehouse will serve as our centralised data repository, housing the data obtained from this Kaggle Open Dataset.

create a brand new Lakehouse

let’s navigate to your new lakehouse and upload Kaggle Open Dataset. Once the file is successfully uploaded, proceed to create a new notebook. For my notebook, I named it “NBK_HRCaseStudy” — very creative, I know!

To test the accessibility of the uploaded file, run the below code snippet in your notebook.

df = spark.read.format("csv").option("header","true").load("Files/HR Case Study.csv")
# df now is a Spark DataFrame containing CSV data from "Files/HR Case Study.csv".
display(df)

Upon running it, you should see the below-visualised data return from your data frame displaying the results.

Sophisticated prompt engineering.

Now that we have successfully loaded our data into our data frame, it’s time to unleash the magic of chatGPT. I present to you the highly sophisticated prompt below.

highly sophisticated prompt — Sophisticated prompt engineering

Results of sophisticated prompt engineering.

Results of Sophisticated prompt engineering

from pyspark.sql import functions as F

# Function to generate hash numbers
def generate_hash(col_values):
    return F.abs(F.hash(F.concat_ws("_", *col_values)))

# Set default values
yesterday = F.date_sub(F.current_date(), 1)
today = F.current_date()

# Create the Employee Dimension DataFrame
dim_employee = df.select(
    generate_hash([F.col("Name"), F.col("Gender"), F.col("Age")]).alias("EmployeeID"),
    F.col("Name").alias("Name"),
    F.col("Gender").alias("Gender"),
    F.col("Age").cast("integer").alias("Age"),
    F.col("Tenure in Company").cast("decimal(5,2)").alias("TenureInCompany"),
    yesterday.alias("ValidFrom"),
    today.alias("ValidTo"),
    F.lit(True).alias("CurrentFlag")
)

# Create the Department Dimension DataFrame
dim_department = df.select(
    generate_hash([F.col("Department")]).alias("DepartmentID"),
    F.col("Department").alias("Department"),
    yesterday.alias("ValidFrom"),
    today.alias("ValidTo"),
    F.lit(True).alias("CurrentFlag")
)

# Create the Salary Dimension DataFrame
dim_salary = df.select(
    generate_hash([F.col("Annual Salary")]).alias("SalaryID"),
    F.col("Annual Salary").cast("decimal(10,2)").alias("AnnualSalary"),
    yesterday.alias("ValidFrom"),
    today.alias("ValidTo"),
    F.lit(True).alias("CurrentFlag")
)

# Create the Location Dimension DataFrame
dim_location = df.select(
    generate_hash([F.col("Location")]).alias("LocationID"),
    F.col("Location").alias("Location"),
    yesterday.alias("ValidFrom"),
    today.alias("ValidTo"),
    F.lit(True).alias("CurrentFlag")
)

# Create the Rating Dimension DataFrame
dim_rating = df.select(
    generate_hash([F.col("Rating")]).alias("RatingID"),
    F.col("Rating").alias("Rating"),
    yesterday.alias("ValidFrom"),
    today.alias("ValidTo"),
    F.lit(True).alias("CurrentFlag"),
    F.current_date().alias("EffectiveDate")
)

# Create the Date Dimension DataFrame
dim_date = df.select(
    F.monotonically_increasing_id().alias("DateKey"),
    F.to_date(F.lit("1900-01-01")).alias("Date"),
    F.dayofmonth(F.lit("1900-01-01")).alias("Day"),
    F.month(F.lit("1900-01-01")).alias("Month"),
    F.year(F.lit("1900-01-01")).alias("Year"),
    F.lit(False).alias("IsBankHoliday")
)

# Create the Employee Fact DataFrame
fact_employee = df.select(
    generate_hash([F.col("Name"), F.col("Department"), F.col("Annual Salary"), F.col("Location"), F.col("Rating")]).alias("EmployeeID"),
    generate_hash([F.col("Department")]).alias("DepartmentID"),
    generate_hash([F.col("Annual Salary")]).alias("SalaryID"),
    generate_hash([F.col("Location")]).alias("LocationID"),
    generate_hash([F.col("Rating")]).alias("RatingID"),
    F.col("Distance to Office").cast("decimal(10,2)").alias("DistanceToOffice"),
    yesterday.alias("ValidFrom"),
    today.alias("ValidTo")
)

# Display the created DataFrames
print("Employee Dimension:")
dim_employee.show(truncate=False)

print("Department Dimension:")
dim_department.show(truncate=False)

print("Salary Dimension:")
dim_salary.show(truncate=False)

print("Location Dimension:")
dim_location.show(truncate=False)

print("Rating Dimension:")
dim_rating.show(truncate=False)

print("Date Dimension:")
dim_date.show(truncate=False)

print("Employee Fact:")
fact_employee.show(truncate=False)

Create a managed table

Managed tables are tables for which both the schema metadata and the data files are managed by Fabric

#create("Employee Dimension:")
dim_employee.write.format("delta").saveAsTable("dim_employee")

#create("Department Dimension:")
dim_department.write.format("delta").saveAsTable("dim_department")

#create("Salary Dimension:")
dim_salary.write.format("delta").saveAsTable("dim_salary")

#create("Location Dimension:")
dim_location.write.format("delta").saveAsTable("dim_location")

#create("Rating Dimension:")
dim_rating.write.format("delta").saveAsTable("dim_rating")

#create("Date Dimension:")
dim_date.write.format("delta").saveAsTable("dim_date")

#create("Employee Fact:")
fact_employee.write.format("delta").saveAsTable("fact_employee")

Testing Managed tables

Once you have your Managed tables created, you can run a simple select statement to test that you have populated your tables properly

%%sql
SELECT * FROM LKH_HRCaseStudy.fact_employee LIMIT 1000

Conclusion

In conclusion, this modelling journey with chatGPT has been a remarkable experience thus far, with ChatGPT proving to be a valuable resource. It has pleasantly surprised me with its ability to delve into the foundations of data modelling and assist me in comprehending the subject more effectively. Additionally, I have found Microsoft Fabric to be an incredibly user-friendly platform, despite its relatively new status. Its ease of use has made my exploration and implementation of the physical data model seamless.

I plan to embark on a side quest to delve deeper into Microsoft Fabric, its capabilities, and some of the new data engineering terminologies and concepts I introduced in this phase, such as Notebooks and Lakehouse. This exploration promises to enhance my understanding and expand my data modelling skills.

Furthermore, in my next experiment (Exploring Data Modelling with ChatGPT Part 3: Heading to follow), I will explore analytics and reporting from my data model using the new lakehouse I have created.

Overall, I am thrilled with the findings from ChatGPT and pleasantly surprised by the usability of Microsoft Fabric.

GoPenAI

Exploring Data Modelling with ChatGPT: Part 2 Model Population

Part 2: Model Population with ChatGPT/Microsoft Fabric

Fabric (preview) trial - Microsoft Fabric

Microsoft Fabric has launched as a public preview and is temporarily provided free of charge when you sign up for the…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in GoPenAI

Written by Tshepiso Mogoswane

Responses (1)