GoPenAI

Where the ChatGPT community comes together to share insights and stories.

Follow publication

Exploring Data Modelling with ChatGPT: Part 2 Model Population

Tshepiso Mogoswane
GoPenAI
Published in
6 min readJun 13, 2023

--

Part 2: Model Population with ChatGPT/Microsoft Fabric
Part 2: Model Population with ChatGPT/Microsoft Fabric
pop-up suggesting an upgrade to your trial
pop-up suggesting an upgrade to your trial
create a brand new Lakehouse
create a brand new Lakehouse
df = spark.read.format("csv").option("header","true").load("Files/HR Case Study.csv")
# df now is a Spark DataFrame containing CSV data from "Files/HR Case Study.csv".
display(df)
result of the dataframe visualised
result of the dataframe visualised
highly sophisticated prompt
Sophisticated prompt engineering
Results of Sophisticated prompt engineering
from pyspark.sql import functions as F

# Function to generate hash numbers
def generate_hash(col_values):
return F.abs(F.hash(F.concat_ws("_", *col_values)))

# Set default values
yesterday = F.date_sub(F.current_date(), 1)
today = F.current_date()

# Create the Employee Dimension DataFrame
dim_employee = df.select(
generate_hash([F.col("Name"), F.col("Gender"), F.col("Age")]).alias("EmployeeID"),
F.col("Name").alias("Name"),
F.col("Gender").alias("Gender"),
F.col("Age").cast("integer").alias("Age"),
F.col("Tenure in Company").cast("decimal(5,2)").alias("TenureInCompany"),
yesterday.alias("ValidFrom"),
today.alias("ValidTo"),
F.lit(True).alias("CurrentFlag")
)

# Create the Department Dimension DataFrame
dim_department = df.select(
generate_hash([F.col("Department")]).alias("DepartmentID"),
F.col("Department").alias("Department"),
yesterday.alias("ValidFrom"),
today.alias("ValidTo"),
F.lit(True).alias("CurrentFlag")
)

# Create the Salary Dimension DataFrame
dim_salary = df.select(
generate_hash([F.col("Annual Salary")]).alias("SalaryID"),
F.col("Annual Salary").cast("decimal(10,2)").alias("AnnualSalary"),
yesterday.alias("ValidFrom"),
today.alias("ValidTo"),
F.lit(True).alias("CurrentFlag")
)

# Create the Location Dimension DataFrame
dim_location = df.select(
generate_hash([F.col("Location")]).alias("LocationID"),
F.col("Location").alias("Location"),
yesterday.alias("ValidFrom"),
today.alias("ValidTo"),
F.lit(True).alias("CurrentFlag")
)

# Create the Rating Dimension DataFrame
dim_rating = df.select(
generate_hash([F.col("Rating")]).alias("RatingID"),
F.col("Rating").alias("Rating"),
yesterday.alias("ValidFrom"),
today.alias("ValidTo"),
F.lit(True).alias("CurrentFlag"),
F.current_date().alias("EffectiveDate")
)

# Create the Date Dimension DataFrame
dim_date = df.select(
F.monotonically_increasing_id().alias("DateKey"),
F.to_date(F.lit("1900-01-01")).alias("Date"),
F.dayofmonth(F.lit("1900-01-01")).alias("Day"),
F.month(F.lit("1900-01-01")).alias("Month"),
F.year(F.lit("1900-01-01")).alias("Year"),
F.lit(False).alias("IsBankHoliday")
)

# Create the Employee Fact DataFrame
fact_employee = df.select(
generate_hash([F.col("Name"), F.col("Department"), F.col("Annual Salary"), F.col("Location"), F.col("Rating")]).alias("EmployeeID"),
generate_hash([F.col("Department")]).alias("DepartmentID"),
generate_hash([F.col("Annual Salary")]).alias("SalaryID"),
generate_hash([F.col("Location")]).alias("LocationID"),
generate_hash([F.col("Rating")]).alias("RatingID"),
F.col("Distance to Office").cast("decimal(10,2)").alias("DistanceToOffice"),
yesterday.alias("ValidFrom"),
today.alias("ValidTo")
)

# Display the created DataFrames
print("Employee Dimension:")
dim_employee.show(truncate=False)

print("Department Dimension:")
dim_department.show(truncate=False)

print("Salary Dimension:")
dim_salary.show(truncate=False)

print("Location Dimension:")
dim_location.show(truncate=False)

print("Rating Dimension:")
dim_rating.show(truncate=False)

print("Date Dimension:")
dim_date.show(truncate=False)

print("Employee Fact:")
fact_employee.show(truncate=False)
#create("Employee Dimension:")
dim_employee.write.format("delta").saveAsTable("dim_employee")

#create("Department Dimension:")
dim_department.write.format("delta").saveAsTable("dim_department")

#create("Salary Dimension:")
dim_salary.write.format("delta").saveAsTable("dim_salary")

#create("Location Dimension:")
dim_location.write.format("delta").saveAsTable("dim_location")

#create("Rating Dimension:")
dim_rating.write.format("delta").saveAsTable("dim_rating")

#create("Date Dimension:")
dim_date.write.format("delta").saveAsTable("dim_date")

#create("Employee Fact:")
fact_employee.write.format("delta").saveAsTable("fact_employee")
Managed tables
Managed tables
%%sql
SELECT * FROM LKH_HRCaseStudy.fact_employee LIMIT 1000

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in GoPenAI

Where the ChatGPT community comes together to share insights and stories.

Written by Tshepiso Mogoswane

Solution Architect (Data & AI) | Data Evangelist | Technology Enthusiast | Learn more about all things data by following me mrmogoswane.medium.com

Responses (1)

Write a response