Fraud Analytics ETL Pipeline – PaySim
End-to-end PySpark + Databricks pipeline with dashboards to uncover fraud patterns in financial transactions.
📖 Overview
The PaySim dataset simulates mobile money transactions based on real financial data. This project demonstrates how to turn raw transactions into actionable fraud insights using PySpark and Databricks, with curated dashboards for decision-makers.
🎯 Problem
- Detect fraudulent transactions
- Identify high-risk customers & accounts
- Visualize fraud trends for proactive monitoring
🛠️ Solution
- Bronze: Ingest PaySim CSV to Delta
- Silver: Clean/standardize + fraud flags
- Gold: Fraud rate, CLV, hotspots
- Dashboards: Fraud by Type, Trends, Risk Profiling, Hotspots, CLV vs Exposure
🔍 Key Insights
- Transfers show the highest fraud rate (signal for controls).
- High Net Value customers have greater fraud exposure.
- Top accounts concentrate a large share of fraud losses.
- Detectable spikes appear in time series (anomaly windows).

Fraud by Transaction Type — risk concentrated in Transfers.

Fraud Trend — daily counts with 7-day moving averages.

Customer Risk Profiling — CLV segments and risk levels.

Fraud Hotspots — origination/destination accounts most impacted.

CLV vs Fraud Exposure — bubble chart of value vs risk.
🔗 Code
🚀 Work With PyLumeAI
This project shows how we deliver end-to-end fraud analytics — from raw data to dashboards.