Process mining fundamentals, PQL, platform architecture, data engineering, and interview Q&A.
The raw input. A structured dataset with three minimum fields: Case ID (unique business object identifier), Activity (what happened), and Timestamp (when it happened). Optionally enriched with attributes like user, cost, location.
The most foundational technique. Takes an event log and automatically generates a process model (Petri net, BPMN, DFG) showing all actual execution paths, frequencies, and durations β without any prior model definition.
Compares the actual process (from event log) against an expected/normative process model. Identifies deviations β missing activities, wrong order, unauthorized paths β and quantifies fitness and precision metrics.
Takes an existing process model and enriches it with real performance data from the event log β adding timing, bottleneck analysis, resource utilization, and KPI overlays to improve or extend the model.
A unique sequence of activities observed for a subset of cases. E.g., "PO β GR β Invoice β Payment" vs. "PO β Invoice β GR β Payment" are two distinct variants. High variant count = high process complexity.
The most common process model representation. Nodes = activities, edges = "activity A was directly followed by B." Edge weights show frequency and performance metrics. Foundation of Celonis's process visualization.
| Dimension | Traditional Case-Centric | Object-Centric (OCPM) |
|---|---|---|
| Event Log Structure | One case ID per event (forced flattening) | Events linked to multiple objects simultaneously |
| Business Object | Single object type (e.g., only "Order") | Multiple types (Order + Invoice + Delivery in parallel) |
| Key Problem | Convergence/divergence artifacts ("spaghetti") | Naturally handles many-to-many relationships |
| Use Case Fit | Simple, linear, single-object processes | O2C, P2P, complex supply chains with parallel flows |
| AI Accuracy | Biased by flattened/duplicated events | Higher-accuracy ML on true object relationships |
Returns the position/order of an activity within a case's execution path. Used to filter cases where an activity occurred before/after another.
Calculates time between activities (or first/last activity) for each case. Supports TOTAL, AVG, MIN, MAX aggregations. Critical for cycle time analysis.
Returns cases where the actual execution path exactly matches a specified sequence. Used for conformance KPIs and "happy path" rate calculations.
Counts the number of times an activity repeats within a case. Identifies rework loops β a major source of cost inefficiency in O2C, P2P, and ITSM processes.
Computes cumulative totals over ordered case events. Useful for aging analyses, open item tracking, and working capital calculations.
KU = "Knowledge Unique" β returns the distinct lookup value for a case. KM = "Knowledge Multiple" β returns comma-separated values when a case has multiple. Handle many-to-one and many-to-many relationships.
| Component | Description |
|---|---|
| Continuous Extractor | On-premise middleware that monitors SAP tables for changes via Change Data Capture (CDC). Extracts only delta records β not full table dumps. Outputs CSV files for transport. |
| RFC Module | Remote Function Call module installed in SAP. Extracts data from SAP ABAP tables via background jobs. Works without direct DB access. |
| Key SAP Tables | EKKO/EKPO (PO headers/items), EKBE (PO history/GR events), RBKP/RSEG (invoices), BKPF/BSEG (accounting docs), AUFK (orders) |
| Event Construction | SAP doesn't natively store "process events." Celonis constructs them by joining change document tables (CDHDR/CDPOS) with master data. Each document status change becomes an event. |
| Transport Options | Direct DB connection Β· SAP Message Server Β· PI/PO mediation Β· Cloud connector. Enterprise setups use dedicated SAP Basis team coordination. |
| Non-SAP Sources | 100+ pre-built connectors: Oracle EBS, Salesforce, ServiceNow, Workday, SAP S/4HANA, SAP ECC. Custom connectors via REST API and JDBC drivers. |
Given raw transactional tables (no explicit event log), reconstruct Case ID / Activity / Timestamp triplets. Requires SQL joins, UNION ALL across multiple event sources, and timestamp normalization.
Celonis data models use an Activity Table (events) as the spine, linked via foreign keys to Case Table (master data) and Activity Attribute Tables. Star-schema style, optimized for SaolaDB columnar queries.
Advanced SQL skills critical: window functions (LAG/LEAD, ROW_NUMBER, RANK), self-joins for activity sequencing, DATEDIFF for cycle times, CTEs for multi-step event reconstruction.
Real-world event logs have missing timestamps, duplicate events, out-of-order records, and NULL case IDs. Interview may test knowledge of how to detect, filter, or impute these issues.
SaolaDB is in-memory columnar β avoid row-by-row operations. Pushdown filtering to extraction layer, minimize cross-joins, use aggregate tables for large datasets. Columnar storage = fast aggregations on individual fields.
Celonis supports both: batch extraction (periodic full/delta loads) and real-time continuous extraction with near-real-time process monitoring. Architecture choice depends on SLA requirements and source system load tolerance.
LLM-powered next-best-action recommendations delivered directly in Slack and Microsoft Teams. Grounded in Process Intelligence Graph data β not hallucinations. Each recommendation is traceable to a specific process event.
Model Context Protocol server that exposes Celonis Process Intelligence Graph to any AI agent (Claude, GPT, etc.). Enables AI agents to answer questions grounded in real operational data β e.g., "Why is PO-12345 late?"
No-code GenAI tool for business experts to encode domain rules into recommendation models. Combines human domain knowledge with ML β addressing the "last mile" problem of AI adoption without data scientists.
Native ML functions integrated into PQL β anomaly detection, clustering, predictive scoring. Run directly in the query engine without requiring external ML pipelines. E.g., predict payment delay risk per invoice.
LLM-powered translation of plain English questions into valid PQL queries (Beta). E.g., "Show me all cases with late GR in the last 30 days" β generates PQL automatically. Democratizes process analysis.
AI agents without process context produce generic, often wrong outputs. The Process Intelligence Graph provides the enterprise operational context layer that makes AI actionable β grounding LLMs in real business state.
Order creation β Credit check β Delivery β Goods issue β Invoice β Payment receipt. Key KPIs: Order fulfillment time, on-time delivery rate, invoice accuracy, DSO (Days Sales Outstanding).
PR β PO β GR β Invoice β Payment. Key KPIs: PO cycle time, early payment discount capture, 3-way match rate, duplicate invoice rate, DPO (Days Payable Outstanding).
Incident β Triage β Assign β Resolve β Close. Key KPIs: MTTR, SLA breach rate, reopened tickets, escalation rate, and first contact resolution.
Invoice processing, dispute management, dunning, cash application. Key KPIs: Days past due, collection effectiveness index, automated match rate, working capital impact.
WITH events AS (SELECT case_id, vendor_id, activity, event_time FROM activities)CASE WHEN activity = 'PO Created' THEN event_time END with MIN() and MAX() over case partitions. Then: AVG(DATEDIFF(day, po_time, payment_time)) grouped by vendor. In Celonis PQL this would be AVG(CALC_THROUGHPUT(CASE_START, CASE_END, DAYS)) with a dimension filter on vendor.
KU() and KM() in PQL?
KU() (Knowledge Unique): Used when each case maps to exactly ONE value (1:1 or many-to-1). E.g., each PO has one vendor β KU("EKKO"."LIFNR"). Returns a single scalar per case.KM() (Knowledge Multiple): Used when a case can have multiple values (1:many or many:many). E.g., one PO with multiple line items across different cost centers β KM("EKPO"."KOSTL"). Returns a comma-separated string. If you use KU() on a many-valued column, it will pick an arbitrary value and silently lose data β a common bug.