From Intern to SRE | Bhavya Tyagi

Three years ago, I was a college student at Thapar Institute submitting my first pull request at a real company. Today, I think in terms of SLOs, error budgets, and distributed system failure modes. Here's the story of that transformation, and the hard-won lessons that came with it.

The Beginning: When "It Works on My Machine" Was Good Enough

My first real coding experience beyond college assignments was at Fint, a fintech startup where I joined as a backend developer intern. I was handed a Node.js codebase, a MongoDB Atlas connection string, and a Slack channel. That was my onboarding.

At a startup, you learn fast because you have to. There's no senior engineer reviewing your code for three days. You push, you break things, you fix them at 11 PM, and you learn why input validation matters the hard way.

"The best way to learn how systems fail is to be the person who accidentally made them fail."

I built APIs, designed database schemas, and learned that the gap between "working code" and "production-ready code" is approximately the width of the Grand Canyon.

The Research Detour: When I Tried to Predict the Weather

Alongside my startup work, I was doing a research internship at IIT Indore, building CNN models for solar wind prediction. This was a completely different world — instead of shipping features, I was reading papers, tuning hyperparameters, and learning that machine learning is 90% data cleaning and 10% feeling clever.

But this experience taught me something crucial that would serve me well in SRE: the importance of metrics and measurement. In ML, you live and die by your evaluation metrics. You can't just "feel" like your model is better — you need precision, recall, F1 scores. This mindset of rigorous measurement would become central to my SRE work later.

TigerGraph: Learning to Explain Technical Concepts

My stint as a Developer Advocate at TigerGraph through the GitHub Externship Program was unique. Instead of just writing code, I had to explain code. I built a video recommendation system and analytics dashboard, but the real challenge was making graph databases accessible to developers who'd never used one.

This taught me that communication is a technical skill. The ability to write clear documentation, explain system architecture in a design doc, or break down a complex incident during a postmortem — these are not soft skills. They're engineering skills.

LinkedIn: Where Everything Changed

When I joined LinkedIn's Infrastructure - Tools SRE team as an intern in the summer of 2022, I entered a different universe. This was a company serving hundreds of millions of users, where a single misconfiguartion could cascade into a site-wide incident.

My project was building an internal tool to reduce operational toil. Sounds simple enough, right? But at LinkedIn's scale, "reducing toil" means understanding:

How dozens of microservices interact with each other
Why that one cron job from 2018 is actually keeping three systems alive
The difference between a 99.9% and 99.99% availability target (hint: it's about 8 hours of downtime per year)
Why you should never deploy on a Friday

The Five Principles That Stuck

Looking back across all four of these experiences, here are the principles that fundamentally changed how I approach software engineering:

1. Everything Fails. Design for It.

At a startup, you build for the happy path. At scale, you build for the unhappy one. Networks partition. Disks fill up. That API you depend on? It will go down at 2 AM on a Saturday. The question isn't if, but when, and whether your system degrades gracefully or falls over like a house of cards.

python The Retry Pattern

import time
from functools import wraps

def retry(max_attempts=3, backoff_factor=2):
    """Exponential backoff retry decorator."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    wait = backoff_factor ** attempt
                    logger.warning(
                        f"Attempt {attempt+1} failed: {e}. "
                        f"Retrying in {wait}s..."
                    )
                    time.sleep(wait)
        return wrapper
    return decorator

2. Observability Is Not Optional

You can't fix what you can't see. Before LinkedIn, I thought "logging" meant sprinkling print() statements around. Now I think in terms of the three pillars: metrics, logs, and traces. If your system doesn't have all three, you're flying blind.

3. Automate the Toil, Keep the Thinking

Toil is any work that is manual, repetitive, automatable, and devoid of lasting value. Every minute spent on toil is a minute not spent on engineering. My entire project at LinkedIn was about this: taking repetitive operational tasks and turning them into self-service tools. The goal isn't to remove humans from the loop — it's to free them up for the work that actually requires human judgment.

4. Incidents Are Learning Opportunities

The blameless postmortem culture at LinkedIn was eye-opening. When something breaks, the question is never "whose fault is it?" but "what can we learn from this?" Every incident is a gift — it's your system telling you where it's weak. The best engineering teams I've seen are the ones that treat failure as data.

5. Simple Beats Clever, Every Single Time

In college, I wanted to write the most elegant, abstract, design-pattern-heavy code possible. In production, I've learned that the best code is the code your on-call engineer can understand at 3 AM with one eye open. Boring technology is beautiful technology.

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
— Brian Kernighan

What's Next

The journey from intern to engineer is not about learning specific technologies — those change every few years anyway. It's about developing an engineering mindset: thinking in systems, measuring what matters, communicating clearly, and always, always planning for failure.

I'm still early in my career, and I know the lessons ahead will be just as transformative as the ones behind. But if there's one thing I'd tell my past self, it's this: don't be afraid to break things. That's how you learn how they work.

If this resonated with you, feel free to connect with me on LinkedIn. I'd love to hear about your own journey.

From Intern to SRE: Lessons I Learned Building Reliable Systems at Scale