Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    ios – Differences in builds between Xcode 16.4 and Xcode 26

    October 13, 2025

    How to run RAG projects for better data analytics results

    October 13, 2025

    MacBook Air deal: Save 10% Apple’s slim M4 notebook

    October 13, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Big Data»From Lag to Agility: Reinventing Freshworks’ Data Ingestion Architecture
    Big Data

    From Lag to Agility: Reinventing Freshworks’ Data Ingestion Architecture

    big tee tech hubBy big tee tech hubSeptember 24, 20250010 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    From Lag to Agility: Reinventing Freshworks’ Data Ingestion Architecture
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    As a global software-as-a-service (SaaS) company specializing in providing intuitive, AI-powered business solutions, designed to enhance customer and employee experiences. Freshworks depends on real-time data to power decision-making and deliver better experiences to its 75,000+ customers. With millions of daily events across products, timely data processing is crucial. To meet this need, Freshworks has built a near-real-time ingestion pipeline on Databricks, capable of managing diverse schemas across products and handling millions of events per minute with a 30-minute SLA—while ensuring tenant-level data isolation in a multi-tenant setup.

    Achieving this requires a powerful, flexible, and optimized data pipeline—which is exactly what we were set out to build.

    Legacy Architecture and the Case for Change

    Freshworks’ legacy pipeline was built with Python consumers; where each user action triggered events sent in real time from products to Kafka and the Python consumers transformed and routed these events to new Kafka topics. A Rails batching system then converted the transformed data into CSV files stored in AWS S3, and Apache Airflow jobs loaded these batches into the data warehouse. After ingestion, intermediate files were deleted to manage storage. This architecture was well-suited for early growth but soon hit limits as event volume surged.

    Rapid growth exposed core challenges:

    • Scalability: The pipeline struggled to handle millions of messages per minute, especially during spikes, and required frequent manual scaling.
    • Operational Complexity: The multi-stage flow made schema changes and maintenance risky and time-consuming, often resulting in mismatches and failures.
    • Cost Inefficiency: Storage and compute expenses grew quickly, driven by redundant processing and lack of optimization.
    • Responsiveness: The legacy setup couldn’t meet demands for real-time ingestion or fast, reliable analytics as Freshworks scaled. Prolonged ingestion delays impaired data freshness and impacted customer insights.

    As scale and complexity increased, the fragility and overhead of the old system made clear the need for a unified, scalable, and autonomous data architecture to support the business growth and analytics needs.

    New Architecture: Real-Time Data Processing with Apache Spark and Delta Lake

    The solution – A foundational redesign centred on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.

    We designed a single, streamlined architecture where Spark Structured Streaming directly consumes from Kafka, transforms data, and writes it into Delta Lake—all in one job, running entirely within Databricks.

    This shift has reduced data movement, simplified maintenance and troubleshooting, and accelerated time-to-insight.

    The key components of the new architecture:

    The Streaming Component : Spark Structured Streaming

    Each incoming event from Kafka passes through a carefully orchestrated series of transformation steps in Spark streaming; optimized for accuracy, scale, and cost-efficiency:

    1. Efficient Deduplication:
      Events, identified by UUIDs, are checked against a Delta table of previously processed UUIDs to filter duplicates between streaming batches.
    2. Data Validation:
      Schema and business rules filter malformed records, ensure required fields, and handle nulls.
    3. Custom Transformations with JSON-e:
      The JSON-e engine enables advanced, reusable logic—like conditionals, loops, and Python UDFs—enabling product teams to define dynamic, reusable logic tailored to each product.
    4. Flattening to Tabular Form:
      Transformed JSON events are flattened into thousands of structured tables. A separate internal schema management tool ( managing 20,000+ tables & 5M+ columns) lets product teams manage schema changes and automatically promote to production, which is registered in Delta Lake and picked up by Spark streaming seamlessly.
    5. Flattened Data Deduplication:
      A hash of stored columns is compared against the last 4 hours of processed data in Redis, preventing duplicate ingestion and reducing compute costs.

    The Storage Component: Lakehouse

    Once transformed, the data is written directly to Delta Lake tables using several powerful optimizations:

    • Parallel Writes with Multiprocessing:
      A single Spark job typically writes to ~250 Delta tables, applying varying transformation logic. This is executed using Python multiprocessing, which performs Delta merges in parallel, maximising cluster utilization and reducing latency.
    • Efficient Updates with Deletion Vectors:
      Up to 35% of records per batch are updates or deletes. Instead of rewriting large files, we leverage Deletion Vectors to enable soft deletes. This improves update performance by 3x, making real-time updates practical even at a terabyte scale.
    • Accelerated Merges with Disk Caching:
      Disk Caching ensures that frequently accessed (hot) data remains in memory. By caching only the columns needed for merges, we achieve up to 4x faster merge operations while reducing I/O and compute costs. Today, 95% of merge reads are served directly from the cache.

    Autoscaling & Adapting in Real Time

    Autoscaling is built into the pipeline to ensure that the system scales up or down dynamically to handle volume and cost most efficiently without impacting performance.

    Autoscaling is driven by batch lag and execution time, monitored in real time. Resizing is triggered via job APIs through Spark’s QueryListener (OnProgress method after each batch), ensuring in-flight processing isn’t disrupted. This way the system is responsive, resilient, and efficient without manual intervention.

    Built-In Resilience: Handling Failures Gracefully

    To maintain data integrity and availability, the architecture includes robust fault tolerance:

    • Events that fail transformation are retried via Kafka with backoff logic.
    • Permanently failed records are stored in a Delta table for offline review and reprocessing, ensuring no data is lost.
    • This design guarantees data integrity without human intervention, even during peak loads or schema changes and the ability to republish the failed data later.

    Observability and Monitoring at Every Step

    A powerful monitoring stack—built with Prometheus, Grafana, and Elasticsearch—integrated with Databricks gives us end-to-end visibility:

    • Metrics Collection:
      Every batch in Databricks logs key metrics—such as input record count, transformed records, and error rates, which are integrated to Prometheus, with real-time alerts to the support team.
    • Event Tracking:
      Event statuses are logged in Elasticsearch, enabling fine-grained debugging allowing both product(producers) and analytics (consumer) teams to trace issues.

    Transformation & Batch Execution Metrics:

    Transformation & Batch Execution Metrics

    Track transformation health using above metrics to identify issues and trigger alerts for quick investigations

    From Complexity to Confidence

    Perhaps the most transformative shift has been in simplicity.

    What once involved five systems and countless integration points is now a single, observable, autoscaling pipeline running entirely within Databricks. We’ve eliminated brittle dependencies, streamlined operations, and enabled teams to work faster and with greater autonomy.Essentially Fewer moving parts meant Fewer surprises & More confidence.

    By reimagining the data stack around streaming and the Deltalake, we’ve built a system that not only meets today’s scale but is ready for tomorrow’s growth.

    Why Databricks?

    As we reimagined the data architecture, we evaluated several technologies, including Amazon EMR with Spark, Apache Flink, and Databricks. After rigorous benchmarking, Databricks emerged as the clear choice, offering a unique blend of performance, simplicity, and ecosystem alignment that met the evolving needs of Freshworks.

    A Unified Ecosystem for Data Processing

    Rather than stitching together multiple tools, Databricks offers an end-to-end platform that spans job orchestration, data governance, and CI/CD integration, reducing complexity and accelerating development.

    • Unity Catalog acts as the single source of truth for data governance. With granular access control, lineage tracking, and centralized schema management, it ensures
      • our team is able to secure all the data assets well-organized data access for each tenant, preserving strict access boundaries ,
      • Be compliant to regulatory needs with all events & actions being audited in the audit tables along with information on who has access to which assets, and
    • Databricks Jobs have inherent orchestration and replaced reliance on external orchestrators like Airflow. Native scheduling and pipeline execution reduced operational friction and improved reliability.
    • CI/CD and REST APIs helped Freshworks’ teams to automate everything—from job creation, cluster scaling to schema updates. This automation has accelerated releases, improved consistency, and minimized manual errors, allowing us to experiment fast and learn fast.

    Optimized Spark Platform

    • Key capabilities like automated resource allocation, unified batch & streaming architecture, executor fault recovery, and dynamic scaling to process millions of records allowed us to maintain consistent throughput, even during traffic spikes or infra hiccups.

    High-Performance Caching

    • Databricks Disk Caching proved to be the key factor in meeting the required data latency, as most merges were served from hot data stored in the disk cache.
    • Its capability to automatically detect changes in underlying data files and keep the cache updated ensured that the batch processing intervals consistently met the required SLA.

    Delta Lake: Foundation for Real-Time and Reliable Ingestion

    Delta Lake plays a critical role in the pipeline, enabling low-latency, ACID-compliant, high-integrity data processing at scale.

    Delta Lake Feature SaaS Pipeline Benefit
    ACID Transactions Freshworks writes high frequency streaming from multiple sources & concurrent writes on the data. ACID compliance of Delta Lake, Ensures data consistency of data across the reads & writes.
    Schema Evolution Due to the fast growing and inherent nature of the products, the schema of various products keeps evolving and Delta lake’s schema evolution adapts to changing requirements where they are seamlessly applied to delta tables & are automatically picked up by spark streaming applications.
    Time Travel With millions of transactions & audit needs, the ability to go back to a snapshot of data in the Delta Lake supports auditing and rollback to point in time needs.
    Scalable Change Handling & Deletion Vectors Delta Lake supports & enables efficient insert/update/delete operations through transaction logs without rewriting large data files. This proved crucial in reducing ingestion latencies from hours to a few minutes in our pipelines.
    Open Format Freshworks being a multi-tenant SAAS system, the open Delta format provides broad compatibility with analytics tools on top of the Lakehouse; supporting multi-tenant read operations.

    So, by combining Databricks Spark’s speed, Delta Lake’s reliability, and Databricks’ integrated platform, we built a scalable, robust, and cost-effective future-ready foundation for Freshworks’ real-time analytics.

    What We Learned: Key Insights

    No transformation is without its challenges. Along the way, we encountered a few surprises that taught us valuable lessons:

    1. State Store Overhead: High Memory Footprint and Stability Issues

    Using Spark’s dropDuplicatesWithinWatermark caused high memory use and instability, especially during autoscaling, and led to increased S3 list costs due to many small files.

    Fix: Switching to Delta-based caching for deduplication drastically improved memory efficiency and stability. The overall S3 list cost and memory footprint were drastically reduced, helping to reduce the time and cost of data deduplication.

    2. Liquid Clustering: Common Challenges

    Clustering on multiple columns resulted in sparse data distribution and increased data scans, reducing query performance.

    The queries had a primary predicate with several secondary predicates; clustering on multiple columns led to a sparse distribution of data on the primary predicate column.

    Fix: Clustering on a single primary column led to better file organization and significantly faster queries by optimizing data scans.

    3. Garbage Collection (GC) Issues: Job Restarts Needed

    Long-running jobs (7+ days) started experiencing performance slowness and more frequent garbage collection cycles.

    Fix: We had to introduce weekly job restarts to mitigate prolonged GC cycles and performance degradation.

    4. Data Skew: Handling Kafka Topic Imbalance

    Data skew was observed as different Kafka topics had disproportionately varying data volumes. This led to uneven data distribution across processing nodes, causing skewed task workloads and non-uniform resource utilization.

    Fix: Repartitioning before transformations ensured an even and balanced data distribution, balancing data processing load and improved throughput.

    5. Conditional Merge: Optimizing Merge Performance

    Even if only a few columns were needed, the merge operations were loading all columns from the target table, which led to high merge times and I/O costs.

    Fix: We implemented an anti-join before merge and early discard of late-arriving or irrelevant records, significantly speeding up merges by preventing unnecessary data from being loaded.

    Conclusion

    By using Databricks and Delta Lake, Freshworks has redefined its data architecture—moving from fragmented, manual workflows to a modern, unified, real-time platform.

    The impact?

    • 4x improvement in data sync time during traffic surges
    • ~25% Cost saving because of scalable, cost-efficient operations with zero downtime
    • 50% reduction in maintenance effort
    • High availability and SLA-compliant performance—even during peak loads
    • Improved customer experience via real-time insights

    This transformation empowers every customer of Freshworks—from IT to Support—to make faster, data-driven decisions without worrying about the data volume supporting their business needs getting served and processed.



    Source link

    agility Architecture Data Freshworks Ingestion Lag Reinventing
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    How to run RAG projects for better data analytics results

    October 13, 2025

    Part 1 – Energy as the Ultimate Bottleneck

    October 13, 2025

    Building a real-time ICU patient analytics pipeline with AWS Lambda event source mapping

    October 12, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    ios – Differences in builds between Xcode 16.4 and Xcode 26

    October 13, 2025

    How to run RAG projects for better data analytics results

    October 13, 2025

    MacBook Air deal: Save 10% Apple’s slim M4 notebook

    October 13, 2025

    Part 1 – Energy as the Ultimate Bottleneck

    October 13, 2025
    Advertisement
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    ios – Differences in builds between Xcode 16.4 and Xcode 26

    October 13, 2025

    How to run RAG projects for better data analytics results

    October 13, 2025

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2025 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.