Data Engineering & ETL Pipelines

Apache NiFi for Modern ETL Pipelines: Installation, Configuration & Data Integration Guide

A production-grade guide to building automated dataflow pipelines with Apache NiFi — from architecture internals to real ETL pipelines, performance tuning, and multi-source integration.

Flow-Based Programming

Real-Time ETL

Guaranteed Delivery

1. Introduction to Apache NiFi 2. NiFi Internal Architecture 3. Key Features 4. Installation (Single Node)5. nifi.properties Configuration 6. Building Your First ETL Pipeline 7. Data Source Integration 8. Output Channels & Destinations 9. NiFi in Modern Data Platforms 10. Performance Tuning & Best Practices 11. When to Use Apache NiFi 12. Conclusion

01 Introduction to Apache NiFi

Modern data engineering teams face a relentless volume of data from heterogeneous sources — relational databases, REST APIs, Kafka topics, IoT sensors, cloud storage, and log streams. Stitching these together with shell scripts or custom Python jobs creates brittle, unmaintainable pipelines that lack observability, fault tolerance, and scalability.

Traditional ETL tools were built for batch windows, not the always-on, event-driven reality of today's platforms. They struggle with real-time routing, dynamic schemas, and guaranteed data delivery under pressure. Apache NiFi was purpose-built to solve exactly these problems.

What is Apache NiFi?

Apache NiFi is an open-source, flow-based data integration platform developed by the NSA and donated to the Apache Software Foundation. It provides a visual, drag-and-drop interface for designing dataflows — called FlowFile pipelines — with built-in provenance tracking, backpressure control, and guaranteed delivery semantics.

Why Flow-Based Programming?

Flow-based programming (FBP) treats data as discrete packets flowing through a graph of processing components. Each component has a single responsibility — fetch, transform, route, or deliver — making the system composable, testable, and observable. NiFi's visual canvas makes this paradigm accessible without sacrificing operational depth.

Core Problems NiFi Solves

Data ingestion at scale — pull from hundreds of sources simultaneously without code changes
Real-time ETL — process and route streaming data within milliseconds
Multi-destination fan-out — deliver to Kafka, S3, Elasticsearch, and DB simultaneously
Data lineage — track every byte from origin to destination with full provenance

02 NiFi Internal Architecture

Understanding NiFi's internal components is essential for designing resilient pipelines. Everything flows through a tightly-integrated system of repositories, controllers, and processors — all coordinated by the NiFi JVM process.

FlowFile
The fundamental data unit in NiFi. Each FlowFile carries a content payload and a map of attributes (key-value metadata). Processors act on FlowFiles as they travel through the pipeline.
Processor
The core processing unit. Each processor performs one task: GetFile, QueryDatabaseTable, ConvertRecord, PutS3Object, etc. Hundreds of built-in processors cover every integration pattern.
Connection
A directed edge between two processors that acts as a queue. It buffers FlowFiles and enforces backpressure when the downstream processor is overwhelmed.
Controller Service
Shared services used by processors — DBCPConnectionPool for database connections, SSLContextService for TLS, SchemaRegistry for Avro/JSON schemas. Configured once, reused everywhere.
Process Group
A logical container for grouping related processors with their connections. Use them to modularize your flow, apply access controls, and manage versioning independently.
Repositories
Three on-disk repositories: FlowFile (tracks active flow metadata), Content (stores FlowFile data), and Provenance (records every data event for auditing and replay).

NiFi Data Flow Architecture

Backpressure Control: Each Connection queue has configurable object count and data size thresholds. When exceeded, upstream processors pause automatically — protecting downstream systems from overload without data loss.
Guaranteed Delivery: NiFi writes FlowFile content and metadata to local disk before acknowledging receipt. If the JVM crashes mid-flow, data is recovered and replayed from the repository on restart — zero data loss by design.
Flow Control: Processors can be scheduled by time (every 10s), CRON expression, or event-driven triggers. Concurrent tasks per processor are tunable to match available CPU and downstream capacity.

03 Key Features of Apache NiFi

Visual Data Flow Designer
The NiFi UI canvas lets you drag, drop, and connect processors visually. No YAML, no code — just a clear, auditable flow graph. Non-engineers can understand and review pipelines at a glance.
Data Provenance Tracking
Every FlowFile event (CREATE, RECEIVE, FETCH, SEND, DROP) is recorded with timestamp, processor, and data snapshot. You can replay any historical data point for debugging or compliance audits.
Backpressure & Prioritization
Fine-grained queue management with object limits, size limits, and expiration policies. Prioritize FlowFiles by oldest, newest, or custom attributes — ensuring critical data is never starved.
Fault Tolerance
All data is persisted to disk before processing. Failed FlowFiles are routed to dedicated 'failure' relationships where they can be retried, quarantined, or redirected — never silently dropped.
Security & Encryption
TLS everywhere: NiFi-to-NiFi, NiFi-to-Kafka, NiFi-to-database. Attribute-level encryption with EncryptContent processor. Role-based access control (RBAC) via LDAP or Apache Ranger integration.
Cluster Support
NiFi scales horizontally from a single node to a 10+ node cluster. A Zero-Leader Cluster model distributes flow execution across all nodes with an embedded Apache ZooKeeper for coordination and state.

04 Installing Apache NiFi (Single Node)

Minimum Recommended Specs

OS: Ubuntu 22.04 / RHEL 9
RAM: 8 GB minimum
CPU: 4+ vCores
Java: JDK 11 or 21
Disk: SSD, 100 GB+
Network: 1 Gbps+

4.1

Install Java (JDK 21)

# Ubuntu / Debian
sudo apt update && sudo apt install -y openjdk-21-jdk

# RHEL / Amazon Linux
sudo dnf install -y java-21-amazon-corretto

# Verify
java -version
# openjdk version "21.0.x" ...

4.2

Download Apache NiFi

# Download the latest release (check https://nifi.apache.org/download for current version)
export NIFI_VERSION=2.4.0
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip

# Verify SHA-256 checksum
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip.sha256
sha256sum -c nifi-${NIFI_VERSION}-bin.zip.sha256

4.3

Extract and Position

unzip nifi-${NIFI_VERSION}-bin.zip
sudo mv nifi-${NIFI_VERSION} /opt/nifi
sudo chown -R nifi:nifi /opt/nifi

# Create a dedicated system user (security best practice)
sudo useradd -r -s /sbin/nologin -d /opt/nifi nifi

4.4

Start NiFi & Verify

# Start NiFi (single-user mode with auto-generated credentials)
sudo -u nifi /opt/nifi/bin/nifi.sh start

# Check startup status (takes 60-90 seconds on first boot)
/opt/nifi/bin/nifi.sh status

# Get auto-generated username and password
grep -i "Generated" /opt/nifi/logs/nifi-app.log

# Access the Web UI
# https://localhost:8443/nifi

Systemd Service (Production)

For production deployments, register NiFi as a systemd service so it auto-starts on reboot and integrates with standard Linux service management. Use /opt/nifi/bin/nifi.sh install to auto-generate the service unit, then sudo systemctl enable nifi && sudo systemctl start nifi.

05 Understanding nifi.properties Configuration

The conf/nifi.properties file is the master configuration for your NiFi instance. It controls web server binding, repository paths, security settings, and cluster membership. Getting this right at deployment time avoids painful migrations later.

Web Server Configuration

# conf/nifi.properties — Web Server
# -----------------------------------------------
# HTTPS (recommended for production)
nifi.web.https.host=0.0.0.0
nifi.web.https.port=8443

# HTTP (development only — disable in production)
nifi.web.http.host=
nifi.web.http.port=

# Proxy Configuration (behind load balancer / Nginx)
nifi.web.proxy.host=nifi.yourdomain.com:443
nifi.web.proxy.context.path=/

Repository Configuration

# conf/nifi.properties — Repositories
# -----------------------------------------------
# FlowFile Repository — tracks active FlowFile lifecycle state
# Recommended: dedicated SSD mount point
nifi.flowfile.repository.directory=/data/nifi/flowfile-repo

# Content Repository — stores actual FlowFile payload data
# Use multiple directories on separate disks for I/O parallelism
nifi.content.repository.directory.default=/data/nifi/content-repo
# Optional additional content directories:
# nifi.content.repository.directory.disk2=/data2/nifi/content-repo

# Provenance Repository — full data lineage audit trail
# This grows large. Set a retention policy.
nifi.provenance.repository.directory.default=/data/nifi/provenance-repo
nifi.provenance.repository.max.storage.time=30 days
nifi.provenance.repository.max.storage.size=10 GB

# Write-Ahead Log (WAL) for FlowFile repository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog

Backpressure & Queue Settings

# conf/nifi.properties — Queue &amp; Backpressure Defaults
# -----------------------------------------------
# Default object count before backpressure kicks in
nifi.queue.backpressure.count=10000

# Default data size before backpressure kicks in
nifi.queue.backpressure.size=1 GB

# These are global defaults; each Connection in the UI can override them
# For high-throughput streaming pipelines, raise count to 100000
# For memory-constrained environments, lower size to 256 MB

Cluster Configuration (Multi-Node)

# conf/nifi.properties — Cluster Settings
# -----------------------------------------------
nifi.cluster.is.node=true
nifi.cluster.node.address=10.0.1.10
nifi.cluster.node.protocol.port=11443

# ZooKeeper connection for cluster coordination
nifi.zookeeper.connect.string=10.0.1.10:2181,10.0.1.11:2181,10.0.1.12:2181
nifi.zookeeper.root.node=/nifi

# Flow Election — wait for a quorum before starting
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=3

Performance Impact: Repositories

Place the FlowFile and Content repositories on separate SSD volumes. The Provenance repository can be on a separate HDD since it's append-only. Separating I/O paths prevents write contention that silently throttles throughput.

JVM Heap Sizing (bootstrap.conf)

# conf/bootstrap.conf
java.arg.2=-Xms4g
java.arg.3=-Xmx8g
# Set Xms = Xmx to avoid GC-related
# heap resizing pauses in production

06 Building Your First ETL Pipeline

Let's build a real-world ETL pipeline that reads customer order records from MySQL, converts them to JSON, enriches with metadata, and indexes them into Elasticsearch for analytics dashboards.

Step-by-Step Processor Configuration

QueryDatabaseTable — Connect to MySQL

# Controller Service: DBCPConnectionPool
Database Connection URL:  jdbc:mysql://10.0.1.10:3306/ecommerce
Database Driver Class:    com.mysql.cj.jdbc.Driver
Database User:            nifi_reader
Password:                 <encrypted>

# Processor Properties
Table Name:               orders
Maximum-value Columns:    updated_at
Fetch Size:               5000
Max Rows Per Flow File:   1000

ConvertRecord — Transform ResultSet to JSON

# Processor Properties
Record Reader:  JsonTreeReader
Record Writer:  JsonRecordSetWriter
               (Output: array of JSON objects)

# Schema Registry (optional but recommended)
# Define your order schema in Avro to enforce types:

{
  "type": "record", "name": "Order",
  "fields": [
    {"name": "order_id",  "type": "long"},
    {"name": "customer",  "type": "string"},
    {"name": "amount",    "type": "double"},
    {"name": "status",    "type": "string"},
    {"name": "updated_at","type": {"type": "long", "logicalType": "timestamp-millis"}}
  ]
}

UpdateAttribute — Enrich FlowFile Metadata

# NiFi Expression Language attribute enrichment
es.index         = orders-${now():format('yyyy-MM')}
pipeline.version = 1.0
source.table     = orders
ingest.timestamp = ${now():toNumber()}

PutElasticsearchHttp — Index to Elasticsearch

# Processor Properties
Elasticsearch URL:    http://es-cluster:9200
Index:                ${es.index}
Document Type:        _doc
Index Operation:      index
Batch Size:           100
Connection Timeout:   5 secs

# Relationships
success  → LogAttribute (for audit)
failure  → PutFile (/data/nifi/failed-orders) → retry loop

07 Data Source Integration

NiFi's processor library covers virtually every source a modern data platform needs. Below are the most common integration patterns with their key processors and configuration notes.

Relational Databases (MySQL, PostgreSQL, Oracle)

QueryDatabaseTableIncremental polling. Tracks a high-watermark column (e.g., updated_at or id) so only new/changed rows are fetched on each run. Ideal for slow-moving tables.
GenerateTableFetchGenerates SQL queries for parallel partition fetching. Combined with ExecuteSQL and partition strategies, handles tables with billions of rows efficiently across concurrent tasks.
ExecuteSQLExecutes arbitrary SQL — JOINs, aggregations, stored procedures. Best for complex transformations that are more naturally expressed in SQL than as NiFi processor chains.

# PostgreSQL Connection (DBCPConnectionPool Controller Service)
Database Connection URL:  jdbc:postgresql://pg-primary:5432/analytics
Database Driver:          org.postgresql.Driver
Database User:            nifi_ro
Max Total Connections:    20
Min Idle Connections:     5
Validation Query:         SELECT 1

Streaming Sources (Kafka, REST APIs, Cloud Storage)

ConsumeKafkaConsumes from one or more Kafka topics with at-least-once delivery. Configurable consumer group, auto-commit control, and max poll records. Supports SASL/SSL, schema registry integration.
ListenHTTPSpins up an embedded HTTP/HTTPS server to receive webhook payloads, API callbacks, or log shipper data. Supports authentication and SSL termination.
GetFile / ListFileMonitors a local filesystem path or SFTP/FTP remote for new files. ListFile + FetchFile pattern gives more control over large directory trees.
GetS3Object / ListS3Pulls objects from S3 (or S3-compatible storage like MinIO). ListS3 generates listings, FetchS3Object retrieves content — enabling parallel, resumable ingestion.

# Kafka Consumer (ConsumeKafka)
Kafka Brokers:         kafka1:9092,kafka2:9092
Topic Name(s):         orders.events
Group ID:              nifi-orders-consumer
Offset Reset:          latest
Message Demarcator:    (empty → one FF per message)
Max Poll Records:      500
Honor Transactions:    true

# Kafka + Schema Registry
Schema Registry URL:   http://schema-registry:8081
Schema Access:         HWX Schema Reference
Key Deserializer:      String
Value Deserializer:    Avro

08 Output Channels & Destinations

One of NiFi's greatest strengths is its ability to fan out to multiple destinations from a single pipeline. Data can be delivered to a data lake, indexed in Elasticsearch, and published back to Kafka — all in parallel, from one flow.

PutS3Object / PutHDFS

Data Lake

# PutS3Object
Bucket:          raw-data-lake
Key Expression:  ${source.table}/${now():format('yyyy/MM/dd')}/${uuid()}.json
Region:          us-east-1
Storage Class:   STANDARD_IA
Server-side Encryption: AES256

PutDatabaseRecord

Relational DB

# PutDatabaseRecord (PostgreSQL sink)
Record Reader:      JsonTreeReader
Statement Type:     INSERT (or UPSERT)
Table Name:         ${target.table}
Translate Fields:   true
Quote Column IDs:   false
Max Batch Size:     500

PublishKafkaRecord

Kafka Sink

# PublishKafkaRecord
Kafka Brokers:      kafka1:9092,kafka2:9092
Topic Name:         ${kafka.output.topic}
Record Reader:      JsonTreeReader
Record Writer:      AvroRecordSetWriter
Compression Type:   SNAPPY
Delivery Guarantee: REPLICATED

PutMongo

MongoDB

# PutMongo
Mongo URI:          mongodb://mongo1:27017,mongo2:27017
Database:           analytics
Collection:         ${mongo.collection}
Update Mode:        Insert
Write Concern:      MAJORITY

09 NiFi in Modern Data Platform Architecture

NiFi seldom operates in isolation. In production analytics platforms, it serves as the universal data mover — the connective tissue between raw sources and the analytical layer. Here's where it fits in a Lambda/Kappa-style architecture:

NiFi + Kafka: NiFi ingests from databases and APIs, publishes to Kafka topics, providing durable buffering. Kafka consumers (Flink, Spark Streaming, ksqlDB) then derive real-time materialized views and alerts.
NiFi + Spark: NiFi lands raw data in the Data Lake (Parquet on S3). Spark batch jobs run nightly transformations, creating curated/enriched datasets. NiFi can trigger Spark jobs via ExecuteScript or invoke the Livy REST API.
NiFi + Data Lakes: NiFi writes to partitioned S3/HDFS paths using expression language templates. It handles format conversion (CSV → Parquet), schema evolution, and compaction — reducing Spark job complexity significantly.

10 Performance Tuning & Production Best Practices

JVM & Memory Tuning

# conf/bootstrap.conf
# Set heap to 50-70% of node RAM
java.arg.2=-Xms8g
java.arg.3=-Xmx8g

# G1GC is recommended for NiFi 1.x
java.arg.13=-XX:+UseG1GC
java.arg.14=-XX:MaxGCPauseMillis=100

# For NiFi 2.x with JDK 21+
java.arg.13=-XX:+UseZGC
# ZGC delivers sub-millisecond GC pauses
# critical for real-time streaming pipelines

Processor Scheduling

# Per-Processor Scheduling (via UI)
Scheduling Strategy:   Timer Driven
Run Schedule:          0 sec  (runs constantly)
  or CRON expression:  0 0/5 * * * ?  (every 5 min)
Concurrent Tasks:      4     (match CPU cores)

# For high-throughput sources:
# Increase Concurrent Tasks to saturate I/O
# ConsumeKafka: 4-8 concurrent tasks
# PutS3Object:  2-4 (avoid S3 rate limiting)
# QueryDatabase: 1  (prevent DB overload)

Monitoring Metrics

FlowFiles Queued:⚠ > 50k = backpressure risk→ Increase downstream concurrency or add processors
Bytes Read/Written:⚠ Sudden drop = source failure→ Check processor logs and error relationships
Active Threads:⚠ = Max → processor starved→ Increase JVM thread pool (nifi.properties)
GC Duration:⚠ > 100ms = heap pressure→ Increase Xmx or reduce content repo objects
Provenance Events/s:⚠ Drops > 20% = disk I/O issue→ Move provenance repo to faster disk
5-min Bulletin:⚠ Any ERROR bulletins→ Check Bulletin Board in NiFi UI for root cause

# Expose NiFi metrics to Prometheus (nifi.properties)
nifi.metrics.exporter.implementation=org.apache.nifi.reporting.prometheus.PrometheusReportingTask

# Or use the built-in Grafana dashboard:
# https://grafana.com/grafana/dashboards/12398  (NiFi Operations Dashboard)
# Scrape interval: 15s, expose /metrics endpoint on port 9092

Production Deployment Checklist

Use HTTPS only — disable nifi.web.http.port in production

Separate repository volumes: FlowFile, Content, Provenance on distinct disks

Set Provenance retention: max 30 days / 50 GB to prevent disk exhaustion

Enable NiFi Registry for flow version control and promotion between environments

Configure Sensitive Props Key in nifi.properties for encrypting passwords at rest

Use NiFi Parameter Contexts to externalize environment-specific config (avoid hardcoded values)

Enable Cluster Load Balancing connections to distribute FlowFiles across cluster nodes

Set up NiFi Registry with Git backend for CI/CD promoted flows

Set Connection backpressure per-processor — do not rely on global defaults only

Run load tests before going live: use GenerateFlowFile at expected throughput rate

11 When to Use Apache NiFi

NiFi is a powerful general-purpose data mover, but like any tool it has a natural fit. Use it where it excels — and be aware of the trade-offs for certain use cases.

Best Fit Use Cases

Database CDC & Replication
Polling or CDC from relational sources into Kafka or a data lake. NiFi's incremental watermark tracking makes this reliable.
Multi-Source Data Ingestion
Pulling from 20+ heterogeneous sources (APIs, files, DBs) into a unified landing zone without writing integration code.
IoT Data Ingestion
Receiving MQTT / HTTP streams from IoT devices, normalizing, enriching with device metadata, and routing to HDFS + InfluxDB.
Log & Event Processing
Collecting logs from servers via ListenSyslog or GetFile, parsing with ExtractText or LogAttribute, forwarding to Elasticsearch.
ETL Orchestration
Complex multi-step flows with conditional routing, retry logic, schema conversion, and fan-out delivery — all visual and auditable.

Consider Alternatives When...

Sub-millisecond latency needed
For ultra-low latency stream processing (complex event processing), Apache Flink or Kafka Streams are better choices. NiFi's per-FlowFile overhead adds ~1–10ms.
Heavy stateful aggregations
Windowed aggregations, joins across streams, and stateful computations are better handled by Flink or ksqlDB, which have native state stores.
Pure batch job dependencies
If your ETL is purely batch with DAG dependencies, dbt + Airflow is a more natural and testable fit than NiFi flow sequences.
Team with no DataFlow expertise
NiFi has a learning curve. If the team is Python-native, tools like Airbyte or dlt may be faster to adopt for standard connector-based pipelines.

12 Conclusion

Apache NiFi has earned its place in the modern data engineering stack. It solves the right problems — data ingestion at scale, fault-tolerant routing, visual auditability, and guaranteed delivery — without requiring you to write and maintain thousands of lines of custom integration code.

For platform engineers, NiFi's operational profile is excellent: clear metrics, bulletin boards for errors, provenance for debugging, and systemd integration for host-level management. For data engineers, the visual canvas translates complex routing logic into reviewable, version-controlled flow definitions. Both audiences can collaborate on the same artifact.

Explore NiFi Clustering
A 3-node NiFi cluster with embedded ZooKeeper provides horizontal scale and high availability. Load-balanced connections distribute FlowFiles automatically — no manual partitioning required.
Harden Security
Add LDAP/OIDC user authentication, enable TLS end-to-end, integrate with Apache Ranger for policy-based access control, and use NiFi's built-in sensitive property encryption.
Adopt NiFi Registry
NiFi Registry with a Git provider backend enables Git-based flow version control, environment promotion workflows (dev → staging → prod), and CI/CD pipeline integration for DataOps teams.

Table of Contents