Mukesh Barapatre | Personal Portfolio

The Metabase Meltdown

When BI tools like Metabase and MongoDB Charts are hooked directly into a live production database, the conflict between transactional operations (OLTP) and analytical aggregations (OLAP) becomes a ticking time bomb.

Heavy table scans permanently evict "hot" application data from the WiredTiger cache, forcing the database to read from disk, sending API latencies into the stratosphere.

CRITICAL

Prod DB Metrics

CPU Utilization98%

Read Query Latency4,250ms

The Real-Time CDC Architecture

Decoupling analytics entirely by streaming the MongoDB Object Log (oplog) to a dedicated read-replica.

Production DB

Transactions • High Load
Secondary Replica

CDC Worker

TS Stream • Filter • Batch

Analytics Target

Dedicated Read DB
for Visulization Dashboards

Why Not Kafka and Debezium?

I already use Debezium CDC for MySQL workloads, and it's fantastic. However, Debezium requires a heavyweight infrastructure: Kafka, ZooKeeper/KRaft, Kafka Connect, the connector, and a dedicated consumer service. For this specific MongoDB pipeline, the goal was to completely remove that dependency footprint. By utilizing MongoDB's native Change Streams within a lightweight Node.js/TypeScript worker, we poll the oplog directly. No Kafka required, meaning radically lower costs and maintenance.

Implementation Deep Dive

The complete, production-hardened TypeScript pipeline logic separated into 9 logical steps.

Step 01

Safeguarding the Primary Node

Rule number one: Never extract heavy streaming data from the Primary node. We enforce this gracefully at the connection string level using secondaryPreferred and compress the wire protocol to slash AWS transfer egress costs.

Cuts Egress bandwidth by ~80%

db/connection.ts

import { MongoClient } from 'mongodb';

export async function createMongoCDCConnection(): Promise<MongoClient> {
  const client = new MongoClient(process.env.MONGODB_URI, {
    // Read from secondaries to avoid taxing the primary node!
    readPreference: 'secondaryPreferred',
    
    // Enable compression to drastically reduce transfer egress costs
    compressors: ['zstd', 'snappy'],
    zlibCompressionLevel: 6,
    
    // Strict connection properties
    maxPoolSize: 50,
  });

  await client.connect();
  return client;
}

Step 02

Resilient Change Streams

Using collection.watch(), we abstract away complex raw Oplog tailing. Setting updateLookup ensures that even if only a single nested field is $set, we receive the entire document schema.

Guarantees Full Schema Integrity

services/stream.ts

import { Collection, ChangeStream } from 'mongodb';

export function setupChangeStream(collection: Collection, resumeToken?: any): ChangeStream {
  const pipeline = [
    { $match: { 'operationType': { $in: ['insert', 'update', 'replace'] } } }
  ];

  const options = {
    // Get the full document even on partial updates
    fullDocument: 'updateLookup' as const,
    // Resume processing exactly where we left off if a token is passed
    ...(resumeToken && { resumeAfter: resumeToken })
  };

  return collection.watch(pipeline, options);
}

Step 03

Resume Token Handling

If our CDC service restarts (e.g., K8s pod rotation), we shouldn't re-process duplicate events. we capture the _id of the change event as our checkpoint. If the oplog rolls over entirely (throwing error 40515), we fallback safely.

Zero Data Loss on Restarts

services/recovery.ts

let currentResumeToken: any = null;

export async function startCDCListenerWithRecovery(collection: Collection) {
  let stream: ChangeStream;
  try {
    const savedToken = await fetchPersistedToken(); 
    stream = setupChangeStream(collection, savedToken);
  } catch (error: any) {
    // If oplog expired (MongoError 40515)
    if (error.code === 40515) {
      console.warn('Token expired! Falling back to initial sync.');
      stream = setupChangeStream(collection); // Start from 'now'
    } else throw error;
  }

  stream.on('change', async (event) => {
    currentResumeToken = event._id; // Capture resume token
    // Route to the batching queue...
  });
}

Step 04

PII Field Filtering Config

To reduce latency and absolutely guarantee that sensitive PII (like passwords or credit card blobs) never drops into the analytics database, we enforce strict inclusion rules based on a JSON config.

Isolates PII for Security

utils/filter.ts

// config/sync.ts defines logical schema extraction
export const syncConfig = [
  {
    collection: 'users',
    include: ['userId', 'email', 'createdAt', 'status'],
    exclude: ['passwordHash', 'socialSecurity', 'jwtTokens']
  }
];

export function filterDocument(configRules: any, doc: any) {
  let result = { ...doc };

  // 1. Explicit Exclusions (Highest Priority)
  if (configRules.exclude) {
    configRules.exclude.forEach((f: string) => delete result[f]);
  }

  // 2. Explicit Inclusions
  if (configRules.include) {
    const safePayload: any = {};
    configRules.include.forEach((f: string) => safePayload[f] = result[f]);
    result = safePayload;
  }
  return result;
}

Step 05

Memory Buffer Queue

Instead of saving row-by-row and instantly killing network throughput, we buffer incoming stream events into an array and trigger a massive flush when a threshold (size or time) is breached.

Decouples Fast Reads from Writes

lib/queue.ts

class BatchQueue {
  private buffer: any[] = [];
  private timer: NodeJS.Timeout | null = null;
  public onFlush: ((batch: any[]) => Promise<void>) | null = null;

  async add(item: any) {
    this.buffer.push(item);
    if (this.buffer.length >= 1000) await this.flush();
    else if (!this.timer) {
      this.timer = setTimeout(() => this.flush(), 2000);
    }
  }

  async flush() {
    if (this.timer) { clearTimeout(this.timer); this.timer = null; }
    if (this.buffer.length === 0) return;

    const batchToProcess = [...this.buffer];
    this.buffer = [];

    if (this.onFlush) await this.onFlush(batchToProcess);
  }
}
export const cdcQueue = new BatchQueue();

Step 06

Async Bulk Writes

Taking our analytic payload queue, we format them as bulkWrite array payload inserts. Setting ordered: false ensures one validation failure doesn't block the rest of the batch, while upsert provides idempotency.

Thousands of Inserts per Second

lib/flush.ts

import { AnyBulkWriteOperation, Collection } from 'mongodb';

export async function flushBatchToAnalyticsDB(targetColl: Collection, batch: any[]) {
  const operations: AnyBulkWriteOperation[] = batch.map((item) => ({
    updateOne: {
      filter: { _id: item._id },
      update: { $set: item.payload },
      upsert: true
    }
  }));

  try {
    const res = await targetColl.bulkWrite(operations, { 
      ordered: false // skips failures, processes remainder
    });
    console.log(`Flushed ${res.upsertedCount + res.modifiedCount} records`);
  } catch (err: any) {
    console.error('Bulk write encountered an error:', err.message);
  }
}

Step 07

Backpressure Management

A fast change stream reading from a super fast replica but writing to a slow target DB will quickly invoke Heap OOM crashes. Managing Node.js backpressure by dynamically pausing the ChangeStream Cursor is essential.

Prevents Memory Overflows

services/backpressure.ts

export function manageBackpressure(stream: ChangeStream, queueSize: number) {
  const MAX_LIMIT = 5000;
  
  // Inform the MongoDB cursor to hold off pushing events
  if (queueSize >= MAX_LIMIT && !stream.isPaused) {
    console.warn('Queue saturated. Pausing Cursor...');
    stream.pause();
  }

  // Downstream caught up? Let data flow again
  if (queueSize < (MAX_LIMIT / 2) && stream.isPaused) {
    console.info('Queue drained. Resuming Cursor...');
    stream.resume();
  }
}

Step 08

Graceful Shutdown

Whenever the Node worker is SIGTERMed, you must close the loop properly: flush the remaining array queue to the DB, store the final resume token, and close all underlying TCP connections to prevent orphan locks.

Clean Zero-Downtime Deployments

server.ts

async function shutdown(streams: ChangeStream[], client: MongoClient) {
  console.log('Initiating Graceful Shutdown...');
  // 1. Block stream from firing new events
  await Promise.all(streams.map(s => s.close()));

  // 2. Clear out remaining buffered data
  await cdcQueue.flush();

  // 3. Save final state to checkpoint table
  if (currentResumeToken) await saveTokenToDB(currentResumeToken); 

  // 4. Teardown
  await client.close();
  process.exit(0);
}

process.on('SIGINT', () => shutdown(activeStreams, sourceClient));
process.on('SIGTERM', () => shutdown(activeStreams, sourceClient));

Step 09

Dynamic Watch Initializer

Instead of rigid environment maps, we loop over the centralized strict configuration array, instantly spinning up secure parallel streams mapped strictly to independent schemas.

Highly Scalable Configuration

services/init.ts

import { syncConfig } from '../config/sync';

export async function initializeCDCService(sourceDb: Db) {
  const activeStreams: ChangeStream[] = [];

  for (const tableConfig of syncConfig) {
    const stream = setupChangeStream(sourceDb.collection(tableConfig.collection));
    
    stream.on('change', async (event: any) => {
      if (event.fullDocument) {
        // Enforce PII schema rules defined earlier
        const payload = filterDocument(tableConfig, event.fullDocument);
        
        await cdcQueue.add({
          _id: event.documentKey._id,
          collection: tableConfig.collection,
          payload
        });
      }
    });
    activeStreams.push(stream);
  }
  return activeStreams;
}

The Value Delivered

By completely decoupling our analytic read load into a dedicated replica managed by our streamlined node worker, the entire application stack restabilized.

API Latency dropped back to absolute zero.
Dashboards load instantly with current data.
Insanely low infrastructure overhead compared to Kafka.

0%

Prod DB Load

Down from 98%

<1.5s

Data Latency

Txn to Dashboard

~$20

Monthly Data Transit Cost

Thanks to Field Filtering & ZSTD wire compression.

Build Smart, Not Heavy

You don't instantly need a sprawling Kafka cluster to decouple analytics. A precisely engineered TypeScript microservice leveraging native database features isolates the production datastore gracefully, safely, and affordably.