Technology choices around building an analytics system

Posted on Jul 26, 2025

Building Petabyte-Scale Analytics Solutions: A Comprehensive Architecture Comparison

When building analytics solutions that need to handle petabytes of data while serving multiple geographies at high concurrency levels, choosing the right architecture is crucial. This comparison explores different approaches for read-only analytics systems with high user interactivity.

The Challenge

We need to build a system that:

Handles petabyte-scale analytical data
Serves multiple geographies with low latency
Supports high concurrency (thousands of simultaneous users)
Provides read-only analytics with high user interactivity
Integrates with a transactional RDBMS for raw data storage
Delivers fast dashboard responses (sub-second for most queries)

Architecture Options Overview

1. PostgreSQL + Citus + Patroni

2. ClickHouse

3. Amazon Redshift

4. DuckDB (Multiple Geographical Instances)

5. Hybrid Architecture

Detailed Architecture Comparison

1. PostgreSQL + Citus + Patroni

Architecture Components

-- Citus distributed architecture
-- Coordinator node + Multiple worker nodes
-- Patroni for high availability clustering
-- Dedicated read replicas for analytical workloads

-- Distributed table setup
CREATE TABLE fact_orders (
    order_date DATE NOT NULL,
    customer_id BIGINT NOT NULL,
    product_id INTEGER NOT NULL,
    order_amount DECIMAL(10,2)
);

SELECT create_distributed_table('fact_orders', 'customer_id');
SELECT create_reference_table('dim_customers');

Advantages ✅

PostgreSQL Compatibility: Existing ecosystem and tooling
High Availability: Patroni ensures zero downtime
Horizontal Scaling: Citus provides distributed processing
Mixed Workloads: Handles OLTP + OLAP simultaneously
ACID Compliance: Full transactional guarantees
Mature Ecosystem: Extensive documentation and community

Limitations ❌

Complex Setup: Requires expertise in Citus and Patroni
Higher Costs: More infrastructure and maintenance overhead
Performance Trade-offs: Slower than specialized analytical databases
Network Overhead: Distributed queries add latency
Limited Compression: Not optimized for analytical workloads

Performance Characteristics

Query Speed: Good for mixed workloads, slower for pure analytics
Scalability: Excellent horizontal scaling capabilities
Concurrency: Good with proper configuration
Setup Complexity: High

2. ClickHouse

Architecture Features

-- ClickHouse columnar architecture with materialized views
CREATE TABLE fact_orders (
    order_date Date,
    customer_id UInt64,
    product_id UInt32,
    order_amount Decimal(10,2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(order_date)
ORDER BY (order_date, customer_id);

-- Pre-aggregated materialized views for dashboards
CREATE MATERIALIZED VIEW dashboard_summary
ENGINE = AggregatingMergeTree()
AS SELECT
    customer_region,
    product_category,
    uniqCombined(customer_id) as unique_customers,
    sum(order_amount) as revenue
FROM fact_orders f
JOIN dim_customers c ON f.customer_id = c.customer_id
GROUP BY customer_region, product_category;

Advantages ✅

Extreme Performance: Sub-second queries on petabyte datasets
Massive Compression: 10:1 to 100:1 compression ratios
Specialized Analytics: Built-in functions for analytics
Horizontal Scaling: Excellent cluster support
Cost Efficiency: Lower storage and compute costs

Limitations ❌

Steep Learning Curve: Specialized knowledge required
Limited Ecosystem: Fewer third-party tools
SQL Compatibility: Not fully PostgreSQL compatible
Complex Operations: Requires specialized DBA skills
Real-time Updates: Batch-oriented, not real-time

Performance Characteristics

Query Speed: Excellent for analytical workloads
Scalability: Outstanding for petabyte-scale data
Concurrency: Good with proper cluster sizing
Setup Complexity: Medium to High

3. Amazon Redshift

Architecture Features

-- Redshift MPP architecture
CREATE TABLE fact_orders (
    order_date DATE SORTKEY,
    customer_id BIGINT DISTKEY,
    product_id INTEGER,
    order_amount DECIMAL(10,2)
);

-- Result caching and workload management
-- Concurrency scaling for high demand periods
-- Integration with AWS ecosystem

Advantages ✅

Managed Service: No infrastructure management
Elastic Scaling: Easy to add/remove nodes
AWS Integration: Seamless with other AWS services
BI Tool Support: Native integration with popular tools
Predictable Performance: Consistent query performance

Limitations ❌

Vendor Lock-in: Tied to AWS ecosystem
Higher Costs: Expensive for large-scale deployments
Limited Customization: Less control over infrastructure
Cold Start: Initial query performance can be slow
Data Movement: Requires data to be in AWS

Performance Characteristics

Query Speed: Good for analytical workloads
Scalability: Excellent with elastic scaling
Concurrency: Good with concurrency scaling
Setup Complexity: Low (managed service)

4. DuckDB (Multiple Geographical Instances)

Architecture Features

###   Multiple DuckDB instances across geographies
###   Read-only analytical processing
###   Data refreshed from PostgreSQL master

class RegionalDuckDBManager:
    def __init__(self, region):
        self.region = region
        self.conn = duckdb.connect(f'duckdb_{region}.db')
        self.setup_optimizations()
    
    def setup_optimizations(self):
        self.conn.execute("PRAGMA memory_limit='8GB'")
        self.conn.execute("PRAGMA threads=1")  ###   Single-threaded optimization
        
    def refresh_data(self):
        ###   Refresh from PostgreSQL master
        pass

Advantages ✅

Exceptional Performance: 10-100x faster than traditional databases
Zero Setup Overhead: Simple deployment
Columnar Processing: Optimized for analytical workloads
Vectorized Execution: Modern CPU utilization
Cost Efficiency: Minimal infrastructure costs
Geographic Isolation: No cross-region latency

Limitations ❌

Single-Node: Limited by machine resources
Manual Scaling: Requires manual instance management
Limited Ecosystem: Fewer enterprise tools
Data Freshness: Requires refresh mechanisms
No Built-in HA: Manual failover required

Performance Characteristics

Query Speed: Excellent for analytical workloads
Scalability: Good with horizontal instance scaling
Concurrency: Excellent per instance, requires multiple instances
Setup Complexity: Very Low

5. Hybrid Architecture

Architecture Features

-- Layered approach combining multiple technologies

-- Layer 1: Data Lake (S3) - Petabyte storage
-- Layer 2: ClickHouse - Pre-aggregated analytics
-- Layer 3: DuckDB - Ad-hoc analysis and dashboards
-- Layer 4: PostgreSQL + Citus - Transactional + mixed workloads
-- Layer 5: Redshift - Enterprise reporting

-- Example data flow:
-- PostgreSQL (OLTP) → ETL → S3 (Data Lake) 
--                    ↓
--              ClickHouse (Aggregated)
--                    ↓
--     Regional DuckDB Instances (Dashboards)

Advantages ✅

Best of All Worlds: Leverages strengths of each technology
Flexible Scaling: Different layers can scale independently
Optimized Costs: Right tool for right job
Redundancy: Multiple backup options
Future-Proof: Easy to evolve individual components

Limitations ❌

High Complexity: Multiple systems to manage
Data Synchronization: Complex ETL processes
Higher Costs: Multiple technology stack
Operational Overhead: More moving parts
Expertise Required: Need skills in multiple technologies

Performance Characteristics

Query Speed: Excellent across all layers
Scalability: Outstanding with proper design
Concurrency: Excellent with proper distribution
Setup Complexity: High

Performance Comparison Matrix

Aspect	PostgreSQL + Citus	ClickHouse	Redshift	DuckDB	Hybrid
Query Speed	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Scalability	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐
Setup Complexity	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐
Cost Efficiency	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Maintenance	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐
HA & Reliability	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐
Ecosystem	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐

Detailed Use Case Analysis

For Petabyte-Scale Enterprise BI:

Winner: ClickHouse

ClickHouse excels in this scenario due to:

Massive compression reducing storage costs
Sub-second queries on aggregated materialized views
Horizontal scaling across clusters
Specialized analytics functions for business intelligence

For Cost-Conscious Startups:

Winner: DuckDB (Multiple Instances)

DuckDB wins for budget-conscious deployments:

Minimal infrastructure costs
Exceptional performance for analytical queries
Simple deployment and management
Zero vendor lock-in

For Enterprise with AWS Investment:

Winner: Amazon Redshift

Redshift is ideal for AWS-native enterprises:

Managed service reduces operational overhead
Seamless AWS integration
Enterprise-grade support
Predictable performance

For Companies with PostgreSQL Expertise:

Winner: PostgreSQL + Citus + Patroni

Best for organizations already invested in PostgreSQL:

Leverages existing expertise
Maintains PostgreSQL compatibility
Provides good scalability
Supports mixed workloads

The Ultimate Winner: ClickHouse

Why ClickHouse is the Clear Winner

After analyzing all options for petabyte-scale analytics with high concurrency across multiple geographies, ClickHouse emerges as the ultimate winner for the following compelling reasons:

1. Unmatched Performance at Scale

-- ClickHouse can handle petabyte-scale queries in sub-seconds
SELECT 
    customer_region,
    product_category,
    uniqCombined(customer_id) as unique_customers,
    sum(order_amount) as revenue
FROM fact_orders  -- 100+ billion rows
WHERE order_date >= '2020-01-01'
GROUP BY customer_region, product_category
ORDER BY revenue DESC
LIMIT 1000;

-- Result: Sub-second response time on petabyte datasets

2. Cost Efficiency at Petabyte Scale

Extreme compression: 10-100x storage savings
Efficient compute: Columnar processing reduces CPU usage
Horizontal scaling: Add nodes only when needed
Open source: No licensing costs

3. Perfect for Multi-Geography Deployments

-- ClickHouse cluster across multiple data centers
-- Each shard can be in different geography
-- Replication ensures high availability
-- Local queries for low latency

-- Cluster configuration example:
-- US-East Data Center: Shards 1-4
-- EU-West Data Center: Shards 5-8  
-- AP-Southeast Data Center: Shards 9-12

4. Built for High Concurrency

Thousands of concurrent queries supported
Query prioritization and resource management
Separate clusters for different workloads
Automatic load balancing across shards

5. Specialized for Analytics Workloads

-- Analytics-optimized functions
SELECT 
    quantile(0.5)(order_amount) as median_order,
    quantile(0.95)(order_amount) as p95_order,
    uniqCombined(customer_id) as unique_customers,
    topK(10)(product_name) as top_products
FROM fact_orders
WHERE order_date = today()
GROUP BY order_hour

6. Proven at Scale

Yandex (original developer) handles petabytes daily
Facebook uses ClickHouse for analytics
Uber processes terabytes with ClickHouse
Thousands of production deployments worldwide

Implementation Strategy for ClickHouse

1. Cluster Architecture

-- Recommended ClickHouse cluster setup:
-- 3 Coordinator nodes (ZooKeeper)
-- 12-24 Data nodes (depending on scale)
-- 3+ Geographies with local replicas
-- Separate clusters for different workloads:
--   - Real-time analytics cluster
--   - Batch processing cluster  
--   - Dashboard cluster (pre-aggregated)

2. Data Modeling for Performance

-- Optimal table design for analytics:
CREATE TABLE fact_orders (
    order_date Date,
    customer_id UInt64,
    product_id UInt32,
    order_amount Decimal64(2),
    order_hour UInt8 MATERIALIZED toHour(order_date)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/fact_orders', '{replica}')
PARTITION BY toYYYYMM(order_date)
ORDER BY (order_date, customer_id)
SETTINGS index_granularity = 8192;

-- Materialized views for dashboard performance:
CREATE MATERIALIZED VIEW dashboard_hourly
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/dashboard_hourly', '{replica}')
PARTITION BY toYYYYMM(order_date)
ORDER BY (order_date, order_hour, customer_region)
AS SELECT
    order_date,
    order_hour,
    customer_region,
    uniqCombinedState(customer_id) as unique_customers_state,
    sumState(order_amount) as revenue_state
FROM fact_orders
GROUP BY order_date, order_hour, customer_region;

3. Geographic Distribution Strategy

-- Multi-region deployment:
-- Each geography has:
--   - Local ClickHouse cluster
--   - Regional data replication
--   - Edge caching layer (Redis)
--   - Load balancer for high availability

-- Data synchronization between regions:
-- Primary region: Real-time data ingestion
-- Secondary regions: Asynchronous replication
-- Cross-region queries: Federated queries when needed

4. Performance Optimization

-- Query optimization techniques:
-- 1. Pre-aggregated materialized views for dashboards
-- 2. Proper partitioning for date-based queries
-- 3. Column compression for different data types
-- 4. Distributed table engines for cross-shard queries
-- 5. Query caching for repeated dashboard queries

-- Example optimized dashboard query:
SELECT 
    order_date,
    order_hour,
    customer_region,
    uniqCombinedMerge(unique_customers_state) as unique_customers,
    sumMerge(revenue_state) as revenue
FROM dashboard_hourly
WHERE order_date >= today() - 7
GROUP BY order_date, order_hour, customer_region
ORDER BY revenue DESC
LIMIT 1000;

-- Response time: < 200ms for most dashboard queries

Conclusion

For building petabyte-scale analytics solutions with high concurrency across multiple geographies, ClickHouse is the clear winner due to:

Primary Advantages:

Unmatched performance on petabyte-scale analytical queries
Cost efficiency through extreme compression and efficient scaling
Perfect fit for multi-geography deployments
High concurrency support with proper cluster design
Analytics-optimized with specialized functions and data structures
Proven track record at massive scale in production

Implementation Recommendation:

Start with ClickHouse cluster across your target geographies
Implement materialized views for dashboard performance
Use proper partitioning and data modeling
Deploy edge caching (Redis) for frequently accessed data
Implement proper monitoring and auto-scaling
Plan for data synchronization from your PostgreSQL OLTP system

When to Consider Alternatives:

DuckDB: For development, prototyping, or cost-constrained startups
PostgreSQL + Citus: For organizations heavily invested in PostgreSQL ecosystem
Redshift: For AWS-native enterprises requiring managed services
Hybrid: For complex requirements needing multiple specialized tools

ClickHouse provides the optimal balance of performance, scalability, cost efficiency, and proven reliability for petabyte-scale analytics at high concurrency levels across multiple geographies, making it the definitive choice for modern analytical platforms.