When Notifications Went Rogue in Bengaluru

A Kafka debugging tale from the backend trenches - #1

Jul 09, 2025

🌆 The Calm Before the Chaos

It was just another chaotic evening in Bengaluru. Dhruv, a backend engineer, swiped out of his office on Brigade Road, dreaming of dinner and Netflix. But as every engineer knows, peace is a myth.

“Why did I even get into software?”
MONNEEYYY 💰, his brain replied.

Just as his cab arrived (miraculously not cancelled), his phone buzzed. It was Rohan.

“Check Slack. Support ticket group. Urgent.”

Dhruv sighed. The cab hadn’t even crossed MG Road.

🧨 The Notification Nightmare

Users were getting duplicate notifications - dozens of them. Something was clearly broken in their Kafka-powered notification system.

Dhruv asked the cabbie to turn around.

“Do you use Apache Kafka?” the cab driver asked.

Dhruv blinked. “Yes… how do you know?”

“I’m a software engineer too. Part-time cabbie. I love open-source. But you gotta vet those libraries.”

#peak_bengaluru, Dhruv thought. He made a mental note to refer the guy.

🧪 The Bug Bash

Back in the office, Dhruv and Rohan dove into the logs. Arjun, the ever-curious intern, hovered nearby.

“Okay,” Dhruv said. “Let’s break this down. Three suspects:
The service pushing events
Kafka broker itself
The consumer”

They ruled out the producer, as logs showed no duplicates. Kafka broker? Healthy. No replays. No partition rebalancing drama.

That left the consumer.

“Wait,” Arjun asked, “what is Kafka?”

Rohan grinned.

“Imagine Kafka as a conveyor belt. Producers drop messages. Consumers pick them up. Kafka keeps the belt moving.”
“So our consumer is like a sleepy worker picking the same box again and again?”
“Exactly.”

They examined the consumer logic. It fetched messages in batches and committed the last offset after processing. But logs showed some messages from few batches were being reprocessed.

“Why would that happen?” Dhruv asked.

They listed their hunches on the whiteboard:

Async offset commits: The new library might be committing offsets asynchronously, and failures aren’t retried.
Exceptions during processing: If something crashes mid-batch, the offset commit might be skipped.
Multiple consumers with same group ID – “If someone accidentally deployed another instance with the same group ID, we could be double-processing.”
Offset commit timeout – “Maybe the broker is taking too long to acknowledge the commit, and the consumer retries the batch.”
Improper error handling – “What if the error handler swallows exceptions and retries silently?”
Library bug – “And of course, the new library might just be broken. Which… it is.”

“Who worked on this service originally?” Dhruv asked.
“Let’s call them up.”
“Can’t. He left the company.”
“Okay… what about the tech lead?”
“Moved to Canada.”
“QA who tested this?”
“She’s on leave this week.”

Dhruv sighed. “Perfect. Please tell me we have documentation at-least, right? Right??”

Rohan turned his laptop around. His wallpaper was a meme:

“Oh boy, we don’t have documentation.”, Dhruv expressed disappointment.

“Let’s check when this started,” Rohan said.

Rohan recalled something.

Cue flashback.

🔙 Flashback: The Temptation

Two weeks ago…

“This new Kafka library supports TypeScript and promises better throughput,” Rohan had said.
“Looks promising,” Dhruv replied. “Let’s try it in staging first.”

They did. And it worked. Or so they thought.

🧟 The GitHub Repo of Doom

Back to the present. They opened the GitHub repo of the new library.

Silence.

Then:

“Oh no…” Arjun whispered.

The repo was a graveyard of open issues. Most related to offset commits. None resolved.
They also looked at the offsets commit logic, and it didn’t make any sense to them.

“We need to revert,” Dhruv said.
“Agreed. The old library may be boring, but it works.”

🧩 But Wait…

As they prepped the rollback, Arjun raised a sharp question:

“If we deployed this two weeks ago, why are we seeing issues only now?”

Dhruv nodded. “Good catch. Could be due to strategic slow rollout recommended by PMs. Maybe the system only started handling large volumes today.”

“So, the bug only shows up under pressure?”
“Exactly. Our staging environment barely gets any traffic. This issue needed real-world chaos to reveal itself.”

✅ Rollback & Redemption

They reverted the library, redeployed, and monitored the service for two hours.

No duplicate notifications.

Just peace. And big slices of pizza.

“Nice work, team,” Dhruv smiled. “And Arjun, great question back there.”
“Thanks! Also… can I push a PR to update the README with this issue?”
“Please do. Let’s save the next team from this horror.”

☕ The Aftermath

The next morning, Dhruv opened his to-do list on his laptop.

Refer cab driver – ✅
Negotiate next year’s rent – 📝 (to do)
Delegate work to interns – 📝 (to do)

He checked his email.

First mail: Manager praising the team for the quick resolution of last night’s issue.
Second mail: HR.
“Hi Dhruv, due to financial constraints, we will have to let you go…”

Dhruv sighed. Opened four tabs on his browser:
LinkedIn. GitHub. LeetCode. Crunchyroll.

🎬THE END

🎯 Takeaways

Always vet your open-source libraries.
Kafka is powerful, but unforgiving.
Interns ask the best questions.
Bengaluru cab drivers might just debug your system.
Least, but not the last, better to have other sources of income, as you never know, when your company stops being your “family”. 😁

An Engineer's Journal

Discussion about this post