Data Engineering8 min read18 November 2025

Processing 2.7 Million Records Overnight: A Data Pipeline Case Study

How I designed and operated a data pipeline that processed 2.7 million lead records — deduplication, cleaning, timezone routing, and CRM ingestion — in a single overnight run.

Haroon Mohamed

AI Automation & Lead Generation

The scale problem

When you're running 13 client campaigns simultaneously, data management becomes a different problem than most CRM tutorials prepare you for.

We weren't dealing with hundreds of leads. We were dealing with millions. Across multiple client databases, multiple lead vendors, multiple states, multiple timezone configurations. All of it needing to be cleaned, deduplicated, enriched, and correctly ingested before campaigns could run.

The manual alternative? Days of work. Prone to errors. Not repeatable.

Here's how I built a pipeline that handled 2.7 million records overnight.

The original problem: a data mess

Before the pipeline existed, here's what the data management process looked like:

Lead vendor sends a CSV dump (anywhere from 20k to 200k records)
Someone downloads it, opens in Excel
Manually removes obvious duplicates (by eye)
Reformats phone numbers (no standardisation from vendors)
Guesses which timezone each record belongs to (based on area code, sometimes)
Manually imports into GoHighLevel in batches of 1,000 (GHL import limit)
Repeat for each vendor, each campaign, each client

For 13 clients and 3–5 lead vendors each, this was a full-time job for 2 people running 6–8 hours/day.

The quality was also inconsistent. Human error was everywhere. Leads got assigned to the wrong campaign. Timezone misassignments meant closers were calling leads at 5am or 10pm their local time — burning contacts and violating compliance requirements.

The pipeline design

Stage 1: Ingestion layer

All lead files from all vendors were deposited into a shared input folder (we used a simple network drive, but S3 would be the cloud equivalent).

An ingestion script ran on a nightly schedule:

Detected new files in the input folder
Determined file format (CSV, XLSX, JSON — vendors use different formats)
Normalised all formats into a consistent internal schema
Logged each file's ingestion: source vendor, row count, timestamp

Our internal schema had standardised fields:

first_name, last_name, email, phone_e164, address, city, state, zip, timezone, source_vendor, source_campaign, lead_date, raw_row_id

The phone_e164 normalisation step alone eliminated huge amounts of downstream problems. Every phone number was converted to E.164 format (+1XXXXXXXXXX for US numbers) using a phone parsing library. Numbers that couldn't be parsed were flagged as invalid and excluded from the main pipeline.

Stage 2: Deduplication

This is where the real complexity lives.

We had deduplication at three levels:

Level 1: Exact deduplication Same email address = same person. We used email as the primary unique identifier and deduplicated across all records in the current batch AND against historical records already in the system.

Level 2: Phone deduplication After email, phone number (E.164 formatted) was the next check. A lead with a different email but the same phone number as an existing record was flagged as a probable duplicate.

Level 3: Fuzzy name + address matching For leads without email addresses (common with cold data vendors), we used fuzzy matching on first name + last name + address. A Jaro-Winkler similarity score above 0.85 was treated as a probable duplicate, held for review rather than automatically deduplicated.

Results from the 2.7M record run:

Exact email duplicates: 340,000 (12.6%)
Phone duplicates (different email): 88,000 (3.3%)
Fuzzy duplicates (held for review): 42,000 (1.6%)
Net records after deduplication: 2,230,000

Stage 3: Data cleaning

Cleaning addressed the most common quality problems in lead vendor data:

Phone number validation:

Remove leads with numbers flagged as invalid by the parser
Run against a carrier lookup to identify mobile vs. landline (we only called mobile)
Exclude known non-working numbers (carrier lookup returns "disconnected")

Address cleaning:

USPS address standardisation for US leads
Zip code verification (flag leads where city doesn't match zip)

Email validation:

Basic syntax check (regex)
MX record lookup to verify domain has a mail server
Optional: proprietary email validation API for deliverability scoring

Name cleaning:

Strip all-caps formatting ("JOHN SMITH" → "John Smith")
Remove leading/trailing whitespace
Flag obviously fake names (test, asdfg, etc.) for review

After cleaning, we had approximately 2.1M records that passed all validation checks.

Stage 4: Timezone routing

For US leads, timezone assignment was based on:

State (primary signal)
Area code (secondary signal for states that span multiple timezones)
City/zip code (for edge cases — cities on state borders)

We mapped every US area code to a timezone. For states that split across timezone boundaries (Indiana, Kentucky, Texas borderlands), area codes resolved the ambiguity correctly.

Why does this matter? We were running campaigns with calling windows. A closer in the Eastern timezone shouldn't be calling someone in California at 6pm their own time — that's 3pm Pacific, perfectly fine. But the system needs to know that.

With correct timezone data on every record, we could set calling windows correctly and ensure no leads were called outside TCPA-compliant hours in their local timezone.

Stage 5: Campaign assignment

Each record was routed to the correct client campaign based on:

State (solar campaigns are state-specific due to incentive programs)
Lead source (Facebook leads vs. vendor leads sometimes went to different campaigns)
Homeowner flag (if present in the data)
Age of the lead (fresh leads went to immediate-contact campaigns; older leads went to re-engagement)

Routing rules were stored as a configuration file rather than hardcoded in the pipeline. This meant we could update routing rules (e.g., a client pauses their Arizona campaign) without touching pipeline code.

Stage 6: CRM ingestion

GoHighLevel has an API for bulk contact creation, but the practical limit for reliable ingestion is around 1,000 records per minute if you're not hitting rate limit errors.

For 2.1M records across 13 client sub-accounts, this required:

Batched API calls (500 records per batch)
Rate limiting logic (respect GHL's API limits)
Retry logic with exponential backoff for failed batches
Progress checkpointing (so if the job failed mid-run, it could resume without re-ingesting records already processed)

Each contact was tagged with:

Source vendor
Source campaign
Lead date
Timezone
Qualification status (uncontacted, initially)

Total ingestion time for 2.1M records across 13 sub-accounts: 7 hours.

Error handling and monitoring

Any pipeline processing millions of records will have errors. The question is how you handle them.

We logged every error with enough context to diagnose it:

Failed records were written to a separate rejection log with the reason for rejection
Processing progress was logged every 10,000 records
A summary report was generated at completion: total processed, total rejected, total per client, error rate

At completion, an automated Slack notification fired with the summary. Any error rate above 2% triggered an alert for manual review.

Performance and scale

On the 2.7M record run:

Ingestion + normalisation: 40 minutes
Deduplication: 2.5 hours (the most computationally expensive step)
Cleaning and timezone routing: 1.5 hours
CRM ingestion: 7 hours
Total end-to-end: ~11 hours (ran overnight)

The bottleneck was GoHighLevel's API. The data processing itself was fast — the constraint was how quickly we could write to the CRM.

For comparison, doing this manually would have taken 2–3 full working days minimum, with significantly more errors.

What I'd change if building this today

If I were building this pipeline from scratch today, I'd make a few changes:

Supabase as the intermediate store — Instead of writing directly to GHL from the pipeline, stage everything in Supabase first. This gives you a queryable copy of all your leads, better analytics, and the ability to re-ingest if something goes wrong without reprocessing the source files.
Streaming ingestion — The batch nightly model works, but real-time vendor delivery (via posting URL) would get leads into campaigns faster.
Better fuzzy deduplication — The fuzzy name matching was the weakest part of the pipeline. A proper similarity model would reduce false negatives.
Automated rejection review — Currently flagged records required manual review. A lightweight UI for reviewing and approving fuzzy duplicates would speed up the QA step.

The meta-lesson about data quality

The pipeline itself was weeks of work. But the ROI was immediate and ongoing.

Before: 2 people spending 6–8 hours/day on manual data management. Error-prone. Non-repeatable. Impossible to scale.

After: Fully automated. Consistent output. 2 people freed up for work that actually required human judgment.

The lesson applies everywhere: the bottlenecks in your business that feel like they require human attention usually just require proper engineering.

Questions about building something similar? Get in touch.

Need This Built?

Ready to implement this for your business?

Everything in this article reflects real systems I've built and operated. Let's talk about yours.

Build My System See Live Results →

Haroon Mohamed

Full-stack automation, AI, and lead generation specialist. 2+ years running 13+ concurrent client campaigns using GoHighLevel, multiple AI voice providers, Zapier, APIs, and custom data pipelines. Founder of HMX Zone.

ShareShare on X →

Data Engineering8 min read

Time-Series Data for Marketing Analytics: When PostgreSQL Beats a Real TSDB

Time-series data is data with a timestamp where the timestamp matters. Every event has a "when," and you analyze across the time dimension constantly. For marketing analytics, this is most of the dat…

26 Jun 2026Read →