From Lag to Agility: Reinventing Freshworks’ Data Ingestion Architecture

As a global software-as-a-service (SaaS) company specializing in providing intuitive, AI-powered business solutions, designed to enhance customer and employee experiences. Freshworks depends on real-time data to power decision-making and deliver better experiences to its 75,000+ customers. With millions of daily events across products, timely data processing is crucial. To meet this need, Freshworks has built a near-real-time ingestion pipeline on Databricks, capable of managing diverse schemas across products and handling millions of events per minute with a 30-minute SLA—while ensuring tenant-level data isolation in a multi-tenant setup.

Achieving this requires a powerful, flexible, and optimized data pipeline—which is exactly what we were set out to build.

Legacy Architecture and the Case for Change

Freshworks’ legacy pipeline was built with Python consumers; where each user action triggered events sent in real time from products to Kafka and the Python consumers transformed and routed these events to new Kafka topics. A Rails batching system then converted the transformed data into CSV files stored in AWS S3, and Apache Airflow jobs loaded these batches into the data warehouse. After ingestion, intermediate files were deleted to manage storage. This architecture was well-suited for early growth but soon hit limits as event volume surged.

Rapid growth exposed core challenges:

Scalability: The pipeline struggled to handle millions of messages per minute, especially during spikes, and required frequent manual scaling.
Operational Complexity: The multi-stage flow made schema changes and maintenance risky and time-consuming, often resulting in mismatches and failures.
Cost Inefficiency: Storage and compute expenses grew quickly, driven by redundant processing and lack of optimization.
Responsiveness: The legacy setup couldn’t meet demands for real-time ingestion or fast, reliable analytics as Freshworks scaled. Prolonged ingestion delays impaired data freshness and impacted customer insights.

As scale and complexity increased, the fragility and overhead of the old system made clear the need for a unified, scalable, and autonomous data architecture to support the business growth and analytics needs.

New Architecture: Real-Time Data Processing with Apache Spark and Delta Lake

The solution – A foundational redesign centred on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.

We designed a single, streamlined architecture where Spark Structured Streaming directly consumes from Kafka, transforms data, and writes it into Delta Lake—all in one job, running entirely within Databricks.

This shift has reduced data movement, simplified maintenance and troubleshooting, and accelerated time-to-insight.

The key components of the new architecture:

The Streaming Component : Spark Structured Streaming

Each incoming event from Kafka passes through a carefully orchestrated series of transformation steps in Spark streaming; optimized for accuracy, scale, and cost-efficiency:

Efficient Deduplication:
Events, identified by UUIDs, are checked against a Delta table of previously processed UUIDs to filter duplicates between streaming batches.
Data Validation:
Schema and business rules filter malformed records, ensure required fields, and handle nulls.
Custom Transformations with JSON-e:
The JSON-e engine enables advanced, reusable logic—like conditionals, loops, and Python UDFs—enabling product teams to define dynamic, reusable logic tailored to each product.
Flattening to Tabular Form:
Transformed JSON events are flattened into thousands of structured tables. A separate internal schema management tool ( managing 20,000+ tables & 5M+ columns) lets product teams manage schema changes and automatically promote to production, which is registered in Delta Lake and picked up by Spark streaming seamlessly.
Flattened Data Deduplication:
A hash of stored columns is compared against the last 4 hours of processed data in Redis, preventing duplicate ingestion and reducing compute costs.

The Storage Component: Lakehouse

Once transformed, the data is written directly to Delta Lake tables using several powerful optimizations:

Parallel Writes with Multiprocessing:
A single Spark job typically writes to ~250 Delta tables, applying varying transformation logic. This is executed using Python multiprocessing, which performs Delta merges in parallel, maximising cluster utilization and reducing latency.
Efficient Updates with Deletion Vectors:
Up to 35% of records per batch are updates or deletes. Instead of rewriting large files, we leverage Deletion Vectors to enable soft deletes. This improves update performance by 3x, making real-time updates practical even at a terabyte scale.
Accelerated Merges with Disk Caching:
Disk Caching ensures that frequently accessed (hot) data remains in memory. By caching only the columns needed for merges, we achieve up to 4x faster merge operations while reducing I/O and compute costs. Today, 95% of merge reads are served directly from the cache.

Autoscaling & Adapting in Real Time

Autoscaling is built into the pipeline to ensure that the system scales up or down dynamically to handle volume and cost most efficiently without impacting performance.

Autoscaling is driven by batch lag and execution time, monitored in real time. Resizing is triggered via job APIs through Spark’s QueryListener (OnProgress method after each batch), ensuring in-flight processing isn’t disrupted. This way the system is responsive, resilient, and efficient without manual intervention.

Built-In Resilience: Handling Failures Gracefully

To maintain data integrity and availability, the architecture includes robust fault tolerance:

Events that fail transformation are retried via Kafka with backoff logic.
Permanently failed records are stored in a Delta table for offline review and reprocessing, ensuring no data is lost.
This design guarantees data integrity without human intervention, even during peak loads or schema changes and the ability to republish the failed data later.

Observability and Monitoring at Every Step

A powerful monitoring stack—built with Prometheus, Grafana, and Elasticsearch—integrated with Databricks gives us end-to-end visibility:

Metrics Collection:
Every batch in Databricks logs key metrics—such as input record count, transformed records, and error rates, which are integrated to Prometheus, with real-time alerts to the support team.
Event Tracking:
Event statuses are logged in Elasticsearch, enabling fine-grained debugging allowing both product(producers) and analytics (consumer) teams to trace issues.

Transformation & Batch Execution Metrics:

Track transformation health using above metrics to identify issues and trigger alerts for quick investigations

From Complexity to Confidence

Perhaps the most transformative shift has been in simplicity.

What once involved five systems and countless integration points is now a single, observable, autoscaling pipeline running entirely within Databricks. We’ve eliminated brittle dependencies, streamlined operations, and enabled teams to work faster and with greater autonomy.Essentially Fewer moving parts meant Fewer surprises & More confidence.

By reimagining the data stack around streaming and the Deltalake, we’ve built a system that not only meets today’s scale but is ready for tomorrow’s growth.

Why Databricks?

As we reimagined the data architecture, we evaluated several technologies, including Amazon EMR with Spark, Apache Flink, and Databricks. After rigorous benchmarking, Databricks emerged as the clear choice, offering a unique blend of performance, simplicity, and ecosystem alignment that met the evolving needs of Freshworks.

A Unified Ecosystem for Data Processing

Rather than stitching together multiple tools, Databricks offers an end-to-end platform that spans job orchestration, data governance, and CI/CD integration, reducing complexity and accelerating development.

Unity Catalog acts as the single source of truth for data governance. With granular access control, lineage tracking, and centralized schema management, it ensures
- our team is able to secure all the data assets well-organized data access for each tenant, preserving strict access boundaries ,
- Be compliant to regulatory needs with all events & actions being audited in the audit tables along with information on who has access to which assets, and
Databricks Jobs have inherent orchestration and replaced reliance on external orchestrators like Airflow. Native scheduling and pipeline execution reduced operational friction and improved reliability.
CI/CD and REST APIs helped Freshworks’ teams to automate everything—from job creation, cluster scaling to schema updates. This automation has accelerated releases, improved consistency, and minimized manual errors, allowing us to experiment fast and learn fast.

Optimized Spark Platform

Key capabilities like automated resource allocation, unified batch & streaming architecture, executor fault recovery, and dynamic scaling to process millions of records allowed us to maintain consistent throughput, even during traffic spikes or infra hiccups.

High-Performance Caching

Databricks Disk Caching proved to be the key factor in meeting the required data latency, as most merges were served from hot data stored in the disk cache.
Its capability to automatically detect changes in underlying data files and keep the cache updated ensured that the batch processing intervals consistently met the required SLA.

Delta Lake: Foundation for Real-Time and Reliable Ingestion

Delta Lake plays a critical role in the pipeline, enabling low-latency, ACID-compliant, high-integrity data processing at scale.

Delta Lake Feature	SaaS Pipeline Benefit
ACID Transactions	Freshworks writes high frequency streaming from multiple sources & concurrent writes on the data. ACID compliance of Delta Lake, Ensures data consistency of data across the reads & writes.
Schema Evolution	Due to the fast growing and inherent nature of the products, the schema of various products keeps evolving and Delta lake’s schema evolution adapts to changing requirements where they are seamlessly applied to delta tables & are automatically picked up by spark streaming applications.
Time Travel	With millions of transactions & audit needs, the ability to go back to a snapshot of data in the Delta Lake supports auditing and rollback to point in time needs.
Scalable Change Handling & Deletion Vectors	Delta Lake supports & enables efficient insert/update/delete operations through transaction logs without rewriting large data files. This proved crucial in reducing ingestion latencies from hours to a few minutes in our pipelines.
Open Format	Freshworks being a multi-tenant SAAS system, the open Delta format provides broad compatibility with analytics tools on top of the Lakehouse; supporting multi-tenant read operations.

So, by combining Databricks Spark’s speed, Delta Lake’s reliability, and Databricks’ integrated platform, we built a scalable, robust, and cost-effective future-ready foundation for Freshworks’ real-time analytics.

What We Learned: Key Insights

No transformation is without its challenges. Along the way, we encountered a few surprises that taught us valuable lessons:

1. State Store Overhead: High Memory Footprint and Stability Issues

Using Spark’s dropDuplicatesWithinWatermark caused high memory use and instability, especially during autoscaling, and led to increased S3 list costs due to many small files.

Fix: Switching to Delta-based caching for deduplication drastically improved memory efficiency and stability. The overall S3 list cost and memory footprint were drastically reduced, helping to reduce the time and cost of data deduplication.

2. Liquid Clustering: Common Challenges

Clustering on multiple columns resulted in sparse data distribution and increased data scans, reducing query performance.

The queries had a primary predicate with several secondary predicates; clustering on multiple columns led to a sparse distribution of data on the primary predicate column.

Fix: Clustering on a single primary column led to better file organization and significantly faster queries by optimizing data scans.

3. Garbage Collection (GC) Issues: Job Restarts Needed

Long-running jobs (7+ days) started experiencing performance slowness and more frequent garbage collection cycles.

Fix: We had to introduce weekly job restarts to mitigate prolonged GC cycles and performance degradation.

4. Data Skew: Handling Kafka Topic Imbalance

Data skew was observed as different Kafka topics had disproportionately varying data volumes. This led to uneven data distribution across processing nodes, causing skewed task workloads and non-uniform resource utilization.

Fix: Repartitioning before transformations ensured an even and balanced data distribution, balancing data processing load and improved throughput.

5. Conditional Merge: Optimizing Merge Performance

Even if only a few columns were needed, the merge operations were loading all columns from the target table, which led to high merge times and I/O costs.

Fix: We implemented an anti-join before merge and early discard of late-arriving or irrelevant records, significantly speeding up merges by preventing unnecessary data from being loaded.

Conclusion

By using Databricks and Delta Lake, Freshworks has redefined its data architecture—moving from fragmented, manual workflows to a modern, unified, real-time platform.

The impact?

4x improvement in data sync time during traffic surges
~25% Cost saving because of scalable, cost-efficient operations with zero downtime
50% reduction in maintenance effort
High availability and SLA-compliant performance—even during peak loads
Improved customer experience via real-time insights

This transformation empowers every customer of Freshworks—from IT to Support—to make faster, data-driven decisions without worrying about the data volume supporting their business needs getting served and processed.

Source link

What's Hot

Databricks Spatial Joins Now 17x Faster Out-of-the-Box

Strain-Tuned 2D Materials with Sharper Detection of Toxic Gases

Cisco Meraki + PagerDuty Integration for Faster Incident Response

From Lag to Agility: Reinventing Freshworks’ Data Ingestion Architecture

Databricks Spatial Joins Now 17x Faster Out-of-the-Box

Edge Infrastructure Strategies for Data-Driven Manufacturers

Is Mistral OCR 3 the Best OCR Model?

Databricks Spatial Joins Now 17x Faster Out-of-the-Box

Strain-Tuned 2D Materials with Sharper Detection of Toxic Gases

Cisco Meraki + PagerDuty Integration for Faster Incident Response

This tiny chip could change the future of quantum computing

Don't Miss!

Databricks Spatial Joins Now 17x Faster Out-of-the-Box

Strain-Tuned 2D Materials with Sharper Detection of Toxic Gases

Subscribe to Updates

What's Hot

From Lag to Agility: Reinventing Freshworks’ Data Ingestion Architecture

Legacy Architecture and the Case for Change

New Architecture: Real-Time Data Processing with Apache Spark and Delta Lake

The key components of the new architecture:

Why Databricks?

What We Learned: Key Insights

Conclusion

Related Posts

Subscribe to Updates