One of the core innovations in Conduit is our federated query engine. In this post, we'll explore what federated queries are, how they work, and why they're essential for modern industrial data architectures.
The Traditional Approach: Centralization
Historically, when organizations wanted to analyze data from multiple systems, they followed a predictable pattern:
- Extract data from source systems
- Transform it into a common format
- Load it into a central data warehouse or lake
This ETL (Extract, Transform, Load) approach has been the standard for decades. But in industrial environments, it creates significant challenges:
- Latency: Batch ETL means your "current" data is always hours or days old
- Cost: Moving and storing petabytes of time-series data is expensive
- Governance: Duplicated data creates compliance and security concerns
- Maintenance: ETL pipelines are fragile and require constant attention
Enter Federated Queries
Federated query execution flips this model on its head. Instead of moving data to where the query runs, we move the query to where the data lives.
How It Works
When you submit a query to Conduit, here's what happens:
1. Parse the query and identify required data sources
2. Generate optimized sub-queries for each source system
3. Execute sub-queries in parallel against source systems
4. Stream results back to the federation layer
5. Merge, correlate, and return unified results
Let's walk through a concrete example.
Example: Cross-System Correlation
Suppose you want to correlate temperature readings from your PI historian with alarm events from Ignition:
SELECT
p.timestamp,
p.temperature,
a.alarm_type,
a.severity
FROM pi.temperatures p
JOIN ignition.alarms a
ON p.asset_id = a.asset_id
AND p.timestamp BETWEEN a.start_time AND a.end_time
WHERE p.timestamp > NOW() - INTERVAL '24 hours'
Conduit breaks this into two parallel operations:
Sub-query 1 (PI):
SELECT timestamp, temperature, asset_id
FROM temperatures
WHERE timestamp > NOW() - INTERVAL '24 hours'
Sub-query 2 (Ignition):
SELECT alarm_type, severity, asset_id, start_time, end_time
FROM alarms
WHERE start_time > NOW() - INTERVAL '24 hours'
These execute simultaneously. Results stream back to Conduit, where the join operation correlates records by asset and time window.
Query Optimization
Naive federation would be slow. The key to performance is intelligent query planning:
Predicate Pushdown
Filter conditions are pushed to source systems, reducing data transfer:
Original: SELECT * FROM pi.temps WHERE value > 100
Pushed: PI executes "value > 100" filter locally
Projection Pruning
Only requested columns are retrieved:
Original: SELECT temperature FROM pi.readings
Pruned: PI returns only temperature column, not all 50 columns
Join Reordering
Joins are executed in the optimal order to minimize intermediate result sizes.
Parallel Execution
Independent sub-queries execute in parallel across source systems.
Handling Heterogeneous Data
Industrial systems store data differently:
- Historians use time-series models (timestamp, tag, value)
- SCADA uses event-driven models (state changes)
- MES uses relational models (orders, batches, products)
Conduit's semantic layer maps these different models to a unified schema. When you query "temperature for asset X", Conduit knows:
- In PI, this is tag
T-101.PV - In Ignition, this is
Tags/Building1/Reactor1/Temperature - In the SQL database, this is
sensors.temperature WHERE asset_id = 'X'
Performance Characteristics
Federated queries have different performance characteristics than centralized queries:
| Aspect | Centralized | Federated | |--------|-------------|-----------| | Query latency | Lower (local data) | Higher (network hops) | | Data freshness | Batch delayed | Real-time | | Storage cost | High (copies) | Low (no copies) | | Governance | Complex | Simple |
For most operational queries, the slight latency increase is worth the benefits of real-time data and simplified architecture.
When to Use Federated Queries
Federated queries excel for:
- Operational dashboards requiring real-time data
- Ad-hoc analysis across multiple systems
- Compliance queries where data residency matters
- Integration without ETL pipelines
They're less suitable for:
- Heavy analytics requiring repeated scans of historical data
- Machine learning training on large datasets
For these use cases, consider using Conduit to populate a purpose-built analytics store.
Conclusion
Federated query execution is a paradigm shift in industrial data architecture. By moving queries to data instead of data to queries, organizations can get real-time insights without the cost and complexity of centralized data lakes.
Want to see federated queries in action? Request a demo and we'll show you cross-system correlation on your own data.
