Aniruddha

Rethinking Code Reviews

Aniruddha — Sun, 05 Apr 2026 20:57:10 GMT

Engineering teams are drowning in PRs and nobody wants to say it out loud.

At Pocus, as we adopted more AI coding tools, code reviews became the bottleneck. I had some time over the weekend to dig into why we even have code reviews, which led me to this post by the Graphite team. Long story short, they started as a way to catch bugs because they were expensive to fix. As tests took over that job, reviews quietly reinvented themselves around readability and coherence. That shift made sense: source code was the highest level of representation available, and keeping it readable was the best way to keep a codebase maintainable over time.

Coding agents have fundamentally changed how code gets written, and the review process hasn't kept up. Something that took days now takes minutes, and PR volume has gone up by an order of magnitude. Engineers used to carve out focused time to review carefully, which was feasible when frequency was manageable. Today the same engineer has a dozen PRs waiting before lunch. This is not a discipline problem. It is a structural one.

Abstractions

The reason for this is simple. We have never reviewed the code that actually runs on the machine. When you push code to production, a compiler transforms it. A JIT runtime might recompile it again at execution time. The binary executing on the CPU looks nothing like what you wrote. Nobody asks to review the optimized assembly. We implicitly agreed that the right level to reason about code is the highest level of representation available.

For most of software history, that abstraction was source code. It no longer is. When you open an unfamiliar codebase today, most engineers ask Claude to explain what the relevant module does, how data flows, what the abstractions are. English is now where intent is expressed and decisions are made. Code is increasingly the compiled artifact of that English, the same way assembly is the output of a compiler given source code. The same logic that pointed reviews toward source code now points them toward English. Reviewing below the highest level of representation is a broken process. We are doing it because we haven’t fully accepted what has changed.

Pipeline

Think about how the code generation pipeline has changed today

When humans wrote the plan and the code. It made sense for humans to review both. Reviewing the code in the new world is like reviewing the assembly in the old one. It is the wrong level.

The engineer’s job is to own the plan. That means spending real time in plan mode, working with the agent to understand the problem space, identify the right abstractions, and surface the business context no tool can fully internalize. Most engineers doing this well already run in plan mode much longer before letting the agent write any code. Once the plan exists, let the agent iterate without interruption. At Pocus, the difference in quality between code reviewed after one agent pass versus several was consistently significant.

This also means accepting that agent-generated code won’t always be as clean as what a skilled engineer would write by hand, similar to how compilers sacrifice readability for performance. That’s manageable. Refactoring is easy with high test coverage and a written plan. Rewriting in a new language becomes tractable when the plan exists independently of the implementation.

Future

My strong belief is that the code review process needs to be rethought from the ground up. Not just to deal with the influx of code, but to open up contribution to people who aren’t traditionally software engineers. Today, making a change to a codebase requires knowing the language, the conventions, the patterns. In a plan-first world, an engineer who understands the logic and knows language X can meaningfully contribute to a codebase written in language Y. The plan is language-agnostic. The agent handles the translation. This is a bigger shift than it sounds. It fundamentally changes who can contribute to a codebase and how teams are structured.

Plans need to become first-class artifacts. Not a one-liner PR description, but a structured document: what new classes are we adding, what existing tools are we reusing, what invariants might we be breaking etc. The back and forth happens on the plan, not the code. Code review gets scoped down to one question: did the agent implement what the plan said? The plan gets committed alongside the code. The plan shows the diff. The code is the current snapshot of the world.

Today

We're in a transitionary period. We haven't fully arrived at the future but we've definitely left the past behind. Things are moving fast enough that some of this might be outdated in a month. Here are some things I've found to be helpful.

Let agents loop and review their own code before you jump in. Looking earlier means reviewing something that’s about to change anyway. Iterate on your CLAUDE.md every day. Every time the agent produces something wrong, check in how you want it to think about the codebase. You get to good output faster than you’d expect. Stack your PRs. Agents produce large sprawling changes and stacking forces the right level of decomposition.

Invest heavily in tests - fuzzy as well as deterministic. Non-determinism in code generation should be caught by tests, not human reviewers. Agents make high coverage easy to achieve and tests are what make refactoring safe when agent output needs cleaning up.

Start breaking up the monorepo. Most large monorepos contain things that have no business being together. A data ingestion pipeline and internal warehouse models share almost no code yet often live in the same repo. Agents struggle with giant context. Clearer boundaries with explicit contracts make agent output dramatically better.

Coda

Try a thought experiment: imagine deleting your IDE. What would actually break? The answer tells you a lot about where you're still working the way you used to.

The engineers who figure this out earliest will have a real advantage. The instinct to read every diff and personally verify every change made sense in the old world. The new instinct is to stay close to the plan, trust the harness, and spend your judgment on what only you can provide. Things are moving fast and chaotic right now. Embrace the chaos.

Reflections on Technical Tradeoffs

Aniruddha — Mon, 30 Mar 2026 06:59:13 GMT

A little under three years ago, I joined a tiny startup called Pocus. I joined when the product did a much narrower set of things than it does today. I had a front-row seat to that transformation, both organizationally and technically. Last week, we joined Apollo, and that journey now continues at a much larger scale.

I’ve had some time to reflect on the path here and on the decisions we made, both right and wrong. One of the most important parts of growth is not repeating the same mistakes and compounding on the good decisions. This felt like the right time to write some of them down before they get blurry. Many of these technical decisions shaped not just the product, but the team as well, because at some level, you do ship your org chart.

Pocus is not a typical CRUD SaaS application. We worked with large enterprise and mid-market tech companies like Canva, Asana, and many others that wanted to bring together huge volumes of fragmented go-to-market data and actually do something useful with it.

That meant ingesting data from CRMs (hello Salesforce), data warehouses, emails, call transcripts, product signals, event streams, and third-party datasets; matching and enriching that data; mapping it onto messy, customer-specific hierarchies; and then making it queryable and actionable inside the product. Customers wanted to build lists, define audiences, run workflows, monitor signals, and later use AI systems on top of both structured and unstructured data. In many cases, they were dealing with tens of millions of records and deeply imperfect data, and they still expected the product to feel interactive.

Then, at some point in the middle of all this, AI and LLMs became real product primitives. All of that messy data now had to work in a world where models could reason over it. So the product we have today looks completely different from the one that existed when I joined.

That’s the context for this post. A lot of the important technical decisions were not about building isolated systems in the abstract. They were about building a product that could support large-scale, messy, high-consequence data workloads with a relatively small engineering team.

A lot more went into building Pocus than the decisions in this post. Hiring, team design, execution, and the processes around building and operating the system mattered just as much, if not more. Each of those probably deserves its own separate post.

This post is only about the technical side: the architecture, data platform, infrastructure, orchestration, reliability, and AI system decisions that shaped how we built it.

Another thing that is worth calling out: a lot of these decisions might be relevant to other startups building agentic AI products over messy data while still working toward product-market fit. At the same time, many of them will not make sense if you are too early or too late. They may be wrong if you are still pre-PMF, and they may also be wrong if you already have strong PMF and are mostly dealing with scaling challenges. That matters because decisions do not work in isolation. Context matters a lot, and the right learning applied at the wrong stage can be actively harmful.

Also, somewhat obvious, but everything here reflects my own experience and opinions. Nothing in this post should be read as an official position or endorsement from my current employer, any former employer, or anyone else I’ve worked with.

I’ve graded the decisions using three viewpoints:

🟢 Repeat — I would do this again
🟡 Revisit — This worked fine, but I’m not fully convinced it would be the right choice if I were doing something new
🔴 Regret — I would not do this again

I’ve grouped the technical decisions into three buckets:

Tools that helped us build and operate the platform — the things the engineering team used to move faster and run the system
Core technologies behind the platform — the systems and infrastructure that actually stored, processed, and served the data
System design and architecture decisions — the deeper bets around complexity, consistency, abstractions, and overall system shape

I’ve tried to be as exhaustive as I can about the decisions I remember, though I’m sure I’ve missed a few!

Tools

Building

These were tools we used to build the system. Some of these already existed when I joined, while others were introduced by the team.

🟢 Statsig feature flags for rollouts
- Probably the most generous free plan around if you’re using just feature flags.
🟢 DuckDB for misc debugging
🟢 Momentic over Playwright for webapp testing
- Playwright tests were flaky, and no one really trusted them. We moved to Momentic about half a year ago, wrote most of our tests there, and finally got to a place where we could deploy with confidence.
🟢 GitHub Actions for CI/CD
- It’s the most basic tool ever, but it was more than good enough to get started. I don’t think we were ever seriously worried about outgrowing it.
🟢 Namespace / Blacksmith instead of GitHub runners
- GitHub runners are horrible. They’re not just slow, they’re also incredibly expensive. We moved to Blacksmith and eventually to Namespace. Both platforms were great.
🟢 Graphite instead of GitHub for reviews
- GitHub hasn’t meaningfully improved the PR and code review process in years. Moving to Graphite, especially with stacked PRs, made a noticeable difference to our velocity. Stacking made AI generated code easier to reason about. Later, we also adopted their merge queue, which was another clear improvement over the status quo.
🟢 BiomeJS instead of ESLint
- This actually made it feasible for us to run lint as a pre-commit hook.
🟢 Giving engineers an AI budget and freedom to use tools
- The agentic coding space is evolving so quickly that we avoided locking ourselves into any long-term contract. Instead, we gave engineers a reasonable AI budget and the freedom to use whatever worked best for them. Most people used Cursor or Claude. Some used Codex.
🟡 Tilt for local development
- This worked great until coding agents came around. Supporting local dev environments across multiple parallel agentic coding tools is still an unsolved problem. Tilt was hard for agents to integrate with and close the loop.
🟡 Automated code review
- We tried quite a few tools here. We probably tested most of them before they were ready for primetime. Codex was especially good at catching hard-to-find logic bugs. Claude Code was great at reviewing its own code.
🟡 GitHub
- Reliability is probably the biggest reason I’d consider moving off it.
🟡 Zapier for coordinating state between tools
- In a world where writing code is fast and cheap, no-code solutions mostly add a layer of complexity you just don’t need. That said, Zapier had a huge number of integrations that simply worked. We looked at n8n, Gumloop, and others, but never got them working reliably.
🔴 Retool for one-off tools
- In practice, it was often easier to just check in custom scripts or build one-off UIs directly into the product.
🔴 Kotlin + TypeScript as two backend languages
- Going from one language to two was a mistake. The operational overhead of supporting multiple backend languages is real. I think agentic coding makes this somewhat less painful now, but we made this decision before Claude, and it was a massive pain.

Monitoring

🟢 Honeycomb for tracing
- This was probably one of the most heavily used tools on the engineering team. Investing early in a strong tracing solution paid off. Many of our alerts and dashboards lived in Honeycomb. Moving from Tempo to Honeycomb was a very good decision.
🟢 LogRocket / Jam.dev for bug reporting
- Highly recommend this. It’s a huge improvement over getting bug reports as written descriptions or videos.
🟢 incident.io for alerting and oncall
🟡 Sentry
- Sentry was very helpful for frontend errors. We never really got enough out of it for backend error and exception monitoring. Maybe we were just using it wrong.
🔴 Grafana + Prometheus + Loki self-hosted
- I would happily pay for the cloud version if I were doing this again.
🔴 Standalone product analytics tool
- As someone who helped build one of these tools, this one hurts a bit. But not being able to combine all of our business data in one place made it hard to get enough value out of it. As the marginal cost of writing SQL kept dropping because of LLMs, most of our operational and exploratory analysis ended up happening in Snowflake instead of a standalone analytics tool.
🔴 Snowflake for internal reporting
- This gets expensive very quickly. Then you end up spending a lot of time doing unnatural things just to save money. Snowflake is a great tool, but if you’re trying to use your budget as efficiently as possible, I’d probably start somewhere else.

Core technologies behind the platform

Fullstack

🟢 Prisma for database schemas
🟢 Vite over Next.js
- We started on Next.js, but quickly realized we were barely using the features that justified the added complexity. Moving to Vite was a huge improvement.
🟢 Tailwind
- I still remember someone on our team telling me at our first offsite how amazing Tailwind was. It took us a while to fully migrate, but it was absolutely the right decision.
🟡 Vendor for auth and RBAC
- This is one of those type 1 decisions that’s very hard to undo later. Be very careful about which vendor you pick.
🟡 NestJS
- Nest adds a layer of complexity and abstraction that I think we probably could have done without.

Platform & Infra

🟢 DBT wherever possible,
- At first, we only used DBT to build models for internal use. Over time, we started using it in many more places, including flows that weren’t just about internal reporting. It’s a very simple idea, but it works extremely well.
🟢 Airbyte for connecting to SaaS tools
- Over time, we leaned much more heavily on self-hosted Airbyte to move data around. Cloud pricing is pretty prohibitive, and this is one of those tools that is actually fairly straightforward to self-host.
🟢 Temporal for workflows and async work
- Moving from Prefect to Temporal was one of those very high-leverage decisions that opens up an entirely new way of building systems. The cloud version was too expensive for our use cases, but it was easy enough to self-host.
🟢 ClickHouse, then StarRocks, instead of ES or Postgres
- A lot of people I’ve talked to lean toward Elasticsearch or Postgres with some form of columnar extension. Having built a columnar database at a previous job, I was very motivated not to build one again. ClickHouse, and later StarRocks, gave us a very solid foundation for the kinds of interactive queries we needed.
🟢 Postgres as source of truth for frequently updated data
- Postgres is great when the workload is dominated by frequent updates. Over time, we moved more and more of that kind of data into Postgres.
🟢 S3 as source of truth for unstructured and semi-structured data
🟢 Athena for async data processing
- Some of our athena queries ran for many minutes. The 60 minute timeout still makes little sense to me.
🟢 Parquet as the storage format
🟢 WarpStream instead of Kafka
- Having worked with Kafka at all of my previous jobs, I was honestly surprised by how easy WarpStream was to run and use.
🟡 Iceberg and S3Tables
- Iceberg is great. I think it was still a bit early for some of our use cases, but I’d watch this space closely, especially with S3Tables on AWS.
🟡 RisingWave instead of Flink / Spark
- First impressions were great. It’s just too early for me to say whether it’s stable enough long term.

Infra

🟢 nOps instead of committed spend
- We used to commit to a certain amount of EC2 spend, but we were still too early to really benefit from 3-year commitments. Our AWS rep pointed us to nOps, though there are a bunch of companies that do this kind of machine and spend arbitrage to get you better discounts than standard 1-year commitments.
🟢 TypeScript as the sole language
- We leaned heavily on TypeScript as the main language across the stack. I was pretty skeptical going in, especially after spending a decade mostly writing Scala, C++, and Go, but I ended up being pleasantly surprised by how productive it was.
🟢 Managed EKS / Kubernetes
- Kubernetes gets a bad reputation for being overly complex. In my experience, a lot of that pain comes from people running their own clusters. Managed EKS was an absolute breeze, and Kubernetes was consistently a net positive for us.
🟢 Tailscale for VPN
🟡 Vercel for frontend hosting
- It’s pretty straightforward to host a React app on AWS. The marginal benefits of some of Vercel’s features didn’t really justify the extra complexity for us.
🟡 AWS instead of GCP
- Having worked with GCP for many years, I really missed its simplicity and structure. That said, the AWS account team was great. AWS feels like a platform that works as much because of the people behind it as the technology itself.
🟡 Cloudflare
- The marginal benefit of using Cloudflare over the cloud provider’s native tooling, like Route53, probably wasn’t worth the extra operational burden.
🔴 Postgres via RDS
- RDS is great, but it gets expensive fast, and the pricing levers are pretty opaque unless you’re running at a scale where a dedicated cluster clearly makes sense. There are a lot more options now that I’d want to explore.
🔴 Terraforming everything
- A declarative language only really makes sense if you truly need the same infra stack to be deployable across multiple customers or regions. We didn’t. Terraform mostly got in the way, and every time you needed procedural logic, the workaround was painful.

Systems Design and Architecture

🟢 Scaling vertically instead of horizontally
- An engineer’s first instinct is often to scale horizontally as soon as the system needs to support more customers. But modern CPUs keep getting better, and I had already seen a lot of the pain that comes with horizontal scaling in previous jobs. We made a deliberate decision to scale vertically for as long as possible, and it was incredibly helpful while we were moving fast.
🟢 SQL as the data processing layer
- I didn’t fully appreciate how versatile SQL could be until I saw it used for all kinds of workloads at Pocus. Offloading complexity to SQL-speaking systems like Athena, ClickHouse, Snowflake, and Postgres was a game changer for what the product could support. LLMs also happen to be ridiculously good at writing complex SQL.
🟢 No microservices
- Microservice hell is very real. We avoided microservices entirely. Over time, our APIs grew to support hundreds of GraphQL operations, all deployed as a monolith.
🟢 Separating deployments by sync, async, and stateful workload types
- There’s a real difference between how synchronous request paths behave, how long-running async workloads behave, and how stateful agents behave. Treating those as separate deployment types made the system much more stable.
🟢 Monorepo over multiple repos
🟢 Eventual consistency over strong consistency
- Any strong consistency guarantee becomes very hard once your system spans multiple machines. You need to design with that in mind from the beginning. It’s very hard to retrofit later.
🟢 Overprovisioning instead of early multitenancy work
- Overprovisioning worked surprisingly well for us. We never really ran out of capacity. Compared to the complexity of solving multitenancy early, the extra infrastructure spend was cheap. I expected we’d outgrow this faster than we did.
🟢 Not investing in RAG early
- Sometimes the right move is to ignore the hype and trust your own experiments. We never got results from RAG that matched how it was being marketed at the time. Vector embeddings are a pretty weak proxy for the right contextual knowledge. It was one of those ideas that sounded much better in theory than it worked in practice.
🟢 Not picking an agent framework early
- Models are evolving too quickly to lock yourself into a framework too early.
🟡 Using AI abstractions like Vercel AI SDK
- AI is moving so fast that targeting the lowest common denominator across providers can end up holding you back.
🟡 Investing in canaries
- We invested relatively late in production canaries. At our scale, most of the issues that mattered only really showed up under real production load. Over time, we got much better about building canaries, even for stateful systems.
🟡 Not using self-hosted models for LLMs
- As costs keep rising, self-hosted models are definitely something I’d want to explore more seriously.
🟡 Using a staging cluster for pre-prod testing
- At one point we had dev, staging, and prod clusters. In practice, staging never gave us much confidence that we didn’t already have from dev. Moving away from staging and toward canaries was the better path for building real deployment confidence.
🔴 GraphQL as the query layer
- GraphQL is complex and full of footguns. I’m still not sure the added complexity was worth it.
🔴 Not implementing soft deletions early
- This is one of those type 1 decisions that is very hard to retrofit. I would absolutely do this from day 1 if I were doing it again.
🔴 Not investing in evals early
- We invested too little, too late in AI evals. Evals are essential if you want to understand whether your AI product is actually working. Unlike more deterministic systems, AI outputs are much harder to judge. If you’re building any kind of AI platform, evals should be a day 1 concern.
🔴 Prompt versioning outside the repository
- We tried versioning and managing prompts in a separate tool. Prompts were much easier to iterate in our codebase. Iterating on them in isolation sounds nice, but it’s harder to make work in practice than it seems.
🔴 Docs in code
- I know almost no one who actually likes writing Markdown. I’d strongly consider tools that still connect back to the repository, but offer a much better WYSIWYG experience, more like Notion.

I’m sure I’ll disagree with some of this a few years from now. That’s probably a good thing. The tools will change, the constraints will change, and hopefully my thinking will improve too. But this is the clearest snapshot I can give of what seemed to matter at the time.

If you’ve found tools that worked especially well for your team and aren’t on this list, I’d genuinely love to hear about them.

Rendezvous Hashing: An alternative to Consistent Hashing

Aniruddha — Tue, 07 Jan 2020 07:45:11 GMT

In any kind of stateful distributed system, the problem of mapping a key to a set of machines is pretty common. Even if a distributed system is stateless, you might still want to map a key to the same set of machines for better locality of processing. In its essence, this is very similar to how hash tables work — map a set of k keys to n buckets.

The simplest way to do this is to use modular operations. Hash your key to get a fixed length value, then compute the modulo with n and pick the machine in that slot. For a uniform hash function, this works well if the number of endpoints doesn’t change very frequently and if the cost of re-mapping keys between endpoints is low. If either of those two is not true, this performs very poorly because all of your keys could get remapped if the size of the list changes.

These days, the standard way to limit the number of keys being re-mapped is to use consistent hashing. Most major distributed databases use it in some form or another. Consistent hashing is a special kind of hashing where on an average, K/n keys are remapped whenever the list of endpoints changes (K is the total number of keys). The term consistent hashing first appeared in literature in 1997 in this paper. In consistent hashing, both the keys and the buckets are hashed onto a circle. A key maps to the first bucket that is encountered in the clockwise direction (or counter-clockwise — it doesn’t really matter). Searching for the bucket responsible for a key is pretty simple — pre compute the hash values for all buckets and sort them, hash the key and then run a binary search (in O(log(n))) to find the lowest value that’s higher than the hash of the key. When the buckets are resized, some keys move over to the closest new bucket. On average, the number of keys that need to move is K/n — which is ideal.

One of the biggest drawbacks of consistent hashing is that keys can be imbalanced across buckets. This is mainly because of how resizing is handled. For example, if a bucket is removed, all keys mapped to that bucket move over to the next one (similar for the case where a bucket is added). Ideally, these keys would be distributed equally across all the remaining buckets. To overcome this problem, most implementations divide each physical machine into multiple virtual nodes. Even then, the keys now spread out over as many virtual nodes you assign to a physical machine instead of the ideal state of the load spreading out over all of them. If the number of virtual nodes is not higher than the number of machines, the load can be distributed unevenly.

Rendezvous hashing predates consistent hashing by a year and takes a very different approach to solving these problems, while maintaining the K/n re-mapping invariant. Unfortunately, it’s not as well known as consistent hashing. It’s also known as Highest Random Weight hashing, because of how it’s implemented. Conceptually and practically, it’s much simpler to understand and implement. You hash the key and the machine together and then pick the one with the highest hash value.

type router struct {
  endpoints []*Endpoint
}

func (r *router) Get(key string) *Endpoint {
  var ep *Endpoint
  hashVal := -INF

  for _, e := range r.endpoints {
    h = hash(key, e)
    if h > hashVal {
      ep = e
      hashVal = h
    }
  }
  return ep
}

In case of a uniform hash function, if the buckets change, the keys (on an average, K/n keys) get spread out over all other buckets instead of just one or the number of virtual nodes that were assigned to a machine. The biggest drawback of rendezvous hashing is that it runs in O(n) instead of O(log(n)). However, because you don’t typically have to break each node into multiple virtual nodes, n is typically not large enough for the run-time to be a significant factor.

We actually used this at Twitter in our internal pub/sub platform, EventBus. EventBus was modeled similar to Kafka — there were topics, and topics had subscriptions. A group of clients together consumed a subscription. We called this smallest unit, a stream. Unlike Kafka, EventBus had separate storage and serving layers — so you could scale out the serving layer horizontally. More importantly, any machine could serve a stream. Also, unlike Kafka, we supported a mode where all clients within a subscription could choose to receive a full copy of the stream and implement their own filtering.

Initially, we randomly assigned these streams to different serving machines. This worked fine when the number of streams was in the low hundreds. However, over time, some of our most popular topics (like the one with tweets) gathered streams numbering in the tens of thousands, many with client-side streaming enabled. Because the serving layer kept a local cache of events for each stream and different streams could be reading data at different offsets, every machine started keeping a large amount of data in memory — leading to horrendous GC pressure. We needed an easy way for a group of clients to independently converge on the same serving machine for a particular stream so that an item, once cached, could be sent to multiple clients. We used rendezvous hashing to do this with pretty good results. The clients would select a machine while starting consumption and then periodically rebalance every 5–10 minutes till the throughput across different machines stabilized.

Sometimes, elegant and obscure algorithms tend to outperform conventional wisdom.

Sampling — the good, the bad, and the ugly

Aniruddha — Wed, 16 Oct 2019 08:25:17 GMT

Benjamin Franklin once said — “Those who give up essential accuracy for temporary speed deserve neither speed nor accuracy”. In the real world, however, you frequently have to trade off one for the other. This tradeoff becomes increasingly appealing as your data volume increases. In this post we’ll discuss some common sampling techniques and ways to get the most out of your data, especially as it relates to product or behavioral user analytics.

Before we look at how to sample, it’s important to understand what the data being sampled looks like. In most cases you’re going to collect events. These are an immutable record of user interactions. Each event typically has a timestamp and a user identifier associated with it. Optionally, you might want to collect some metadata with each event. Typically, once you add instrumentation to your apps or websites (or use a tool that automatically collects everything), these events are generated in response to every user interaction. So, if you’re already generating these events — why sample?

There are two main reasons why you might want to consider looking at a smaller subset of your data for insights — speed and cost. In some cases, you can get faster results if you decide to spend more on computation. However, not all computations are infinitely parallelizable.

How to sample

Whether you decide to sample by dropping data during collection or at query time, how you choose to ignore data matters. It’s important to have the sampling be random — otherwise you’ll run into sampling bias, which makes analysis hard. There are a few ways to do this.

Sample every event

This is the most naive way to sample data, but it works well if you only care about aggregates. Here, the decision to sample is independent of the event or user being tracked. When you run aggregates, you can multiply by the inverse of the sampling factor to get an approximate value. The biggest drawback of this approach is that any kind of analysis that depends on a sequence of events (like a funnel report) is usually incorrect.

Sample high volume events

Not all events are made equal. You are likely to have a few outliers that contribute the most to event volume. Random sampling for these outliers and collecting all other events un-sampled works well in practice. This has the same drawbacks as the previous approach, but the impact is restricted to analysis that spans the outliers.

Sample all users

For user analytics, every event is likely to have an associated user identifier. This represents the individual you are tracking information about. The goal of this approach is to keep all activity for a small sample of users and discard all activity for the others. If the user identifiers you keep data for are selected at random, you can extrapolate the results to get accurate numbers. The good part about this approach is that it works for aggregates as well as for any analysis that depends on a sequence of events, so long as it is per-user.

Sample high volume events by user

Here, we take the good parts of approaches 2 and 3 and combine them. We sample events by user but only restrict the sampling to high volume events. Any analysis that you do on events that don’t involve outliers gets full fidelity whereas anything that’s done across outliers still has accurate numbers as long as the analysis is done per-user. Most major analytics providers let you do this.

Sample by users but always track certain populations

This is basically the same as the previous approach, except you have some way to always track a specific set of users based on some pre-defined criteria. Say, you want to track everything about users who pay you more than $1000 per month — you can do that by always including every event for any of these users. This works well if this sample is relatively small compared to the rest of your user base and you are careful about addressing the edge cases of tracking when someone enters or exits this special population.

User Identity Management

Before we look at where to sample, it’s important to look at one of the biggest challenges with user based sampling. In this day and age, people have multiple devices that they might interact with your application or website on. In addition, users might do the bulk of their activity while logged out and only identify themselves when they have to. This flexibility makes it hard to decide which user identifiers should be in the sample and which shouldn’t — mainly because it’s a chicken and egg problem.

Anonymous and Logged In activity

For most applications and websites, some kind of user identification is required to interact with the product. There are exceptions (as we’ll see later), but for a majority of products, Login acts as the great filter. Once a user logs in, the decision of whether to include it in the sample can mostly be made on the basis of whatever unique identifier your database has for that user. It’s also possible to tie back any of their anonymous activity to the identified user based either on heuristics or actual knowledge (if someone logs in on a device where you previously tracked anonymous activity, there is a good chance that the anonymous activity can be tied to the logged-in user).

However, there are entire verticals where a bulk of your users are going to be anonymous. Travel, e-commerce, search, video etc. see a lot of anonymous activity and users don’t necessarily identify themselves during an interaction with your product.

Users with multiple accounts

The other challenge that some products run into is where users have different identifiers on different platforms. You might use a phone number to identify on the app and an email address to identify on a website. Additionally, you might also allow your users to identify using social media accounts. Tying all these activities back to the same user typically requires some kind of heuristics or best-effort matching and it usually happens long after you’ve been tracking information with these different identifiers.

Where to sample

User identity management plays a big role in determining the utility of your approach to sampling, because once you decide how you want to sample your data, the other important decision you’ll have to make is where to do this.

Broadly speaking, you have 2 choices —

collect everything and sample when you run queries.
drop data during collection and run queries on the sampled data.

If you’re considering sampling as a way to reduce costs, it’s helpful to understand the 3 types of costs associated with data —

Collection costs are those associated with tracking and processing the data all the way up to the point where you can decide whether to include the event in the sample or not.

Storage costs are what you pay for keeping the data around at rest. These costs compound over time as the data footprint increases.

Query costs are what you pay for processing the sampled data to get meaningful insights.

Here’s what the costs look like based on the approach you take

+----------------------+-------------+----------+----------+
|        Option        |  Collection |  Storage |   Query  |
+----------------------+-------------+----------+----------+
| Sample at query      |  Full       |  Full    |  Sampled |
| Sample at collection |  Full       |  Sampled |  Sampled |
+----------------------+-------------+----------+----------+

Sample at query

This is a little more expensive than the second one because you pay full storage costs and your collection costs might be a little higher depending on how early in your collection process you can determine whether a user falls in the sample or not. However, it has very few drawbacks because you don’t drop any of the data so you can merge user activity as and when you discover connections between anonymous, logged-in and users with multiple accounts. If you can afford to, collect everything.

Sample at collection

If you absolutely must drop data, there are a few ways to try to minimize the impact —

Sampling Technique

Always sample just the high volume events by the user identifier and keep everything around at full fidelity. This reduces the impact of sampling to any analysis that involves the outliers in terms of volume.

Anonymous vs. Logged In users

If your product has low anonymous activity and most users identify themselves before any interaction, you might be able to get by with keeping a copy of all the anonymous data and only sample data for users who have identified themselves. This gives you full visibility into any anonymous activity and at the same time, any analysis that spans anonymous and logged-in usage is correct.

If your product has high anonymous activity and low logged in activity, flip the two — sample all the anonymous data and keep all the logged in activity around for analysis. The drawback of doing this is that analysis spanning anonymous and logged-in usage will be incorrect.

Unfortunately, for most other cases, sampling at collection results in either incomplete or inaccurate data and there’s no real way to counter that.

Conclusion

Sampling plays an important role in improving the speed of analysis and, in some cases, reducing costs. If you can afford to pay the additional processing and storage costs, always sample while running queries. The costs for running queries are typically much higher than the other two, especially as you scan the data multiple times for different kinds of analysis. Keeping a full copy of all the data also lets you use that data for any kind of analysis that involves machine learning or statistical modeling. It also lets you run exploratory analysis on a sample while still retaining the ability to run more important queries on the full dataset.

If you absolutely can’t afford to keep a full copy, try to minimize the impact of sampling by reducing the scope of user identity management challenges on your choice of sampling technique and make sure that you factor in all the corner cases when interpreting the results of your queries.

Iterating over maps in Go

Aniruddha — Sat, 27 Jul 2019 04:29:45 GMT

While the Go programming language specification only states that the iteration order over maps is not guaranteed to be the same across invocations, Go maps in action goes a step further to say that the order is randomized. So, when someone at work asked how I would design a set of integers that returned a random entry on every get , I suggested this neat trick

type intSet map[int]struct{}

func (s intSet) put(v int) {
        s[v] = struct{}{}
}

func (s intSet) get() (int, bool) {
        for k := range s {
                return k, true
        }
        return 0, false
}

Turns out that this approach is incorrect because, while it returns a “random” number on every get, the probability for every element is not the same.

To test this implementation, let’s actually fill up a map with some values and see the distribution over a million runs.

func main() {
        s := make(intSet)
        for i := 0; i < 8; i++ {
                s.put(i)
        }

        counts := make(map[int]int)
        for i := 0; i < 1024*1024; i++ {
                v, ok := s.get()
                if !ok {
                        return
                }
                counts[v]++
        }

        for k, v := range counts {
                fmt.Printf("Value: %v, Count: %v\n", k, v)
        }
}

This is the output you get on running this

code|⇒ ./code
Value: 1, Count: 131026
Value: 7, Count: 130957
Value: 3, Count: 131064
Value: 5, Count: 131288
Value: 2, Count: 131080
Value: 0, Count: 130813
Value: 4, Count: 131137
Value: 6, Count: 131211

That’s good, right? The distribution of each number is roughly equal. Let’s change the numbers a bit and see what happens. For the next run, I added the numbers 0 to 4 instead.

code|⇒ ./code
Value: 1, Count: 131175
Value: 2, Count: 131593
Value: 3, Count: 130904
Value: 0, Count: 654904

While the counts for 1 , 2 and 3 are roughly the same, 0 occurs almost 5 times as often. A truly random distribution would have been around 250000 occurrences of each number.

To explain this anomaly, it’s important to understand how maps are implemented in go. Unsurprisingly, maps are implemented using go. The map.go file in src/runtime contains the common parts of the implementation (there are some optimized map implementations for common types like integers and strings). The comments in map.go help lay out the structure of a map

// A map is just a hash table. The data is arranged
// into an array of buckets. Each bucket contains up to
// 8 key/value pairs. The low-order bits of the hash are
// used to select a bucket. Each bucket contains a few
// high-order bits of each hash to distinguish the entries
// within a single bucket.
//
// If more than 8 keys hash to a bucket, we chain on
// extra buckets.

Let’s take a look at what happens when you’re iterating over a map. If you disassemble the for loop, you’ll see something like this.

TEXT main.intSet.get(SB) /home/aniruddha/code/main.go
  ...
  main.go:10  0x488e56  4889442408   MOVQ AX, 0x8(SP)
  main.go:10  0x488e5b  488d442418   LEAQ 0x18(SP), AX
  main.go:10  0x488e60  4889442410   MOVQ AX, 0x10(SP)
  main.go:10  0x488e65  e8763df8ff   CALL runtime.mapiterinit(SB)
  main.go:10  0x488e6a  488b442418   MOVQ 0x18(SP), AX
  ...
  main.go:11  0x488e93  c3    RET

The call to mapiterinit is what sets up the iterator and then calls the mapiternext function to get the first element in the map. Here’s the part of the code in mapiterinit that actually computes where to start iterating —

r := uintptr(fastrand())
if h.B > 31-bucketCntBits {
  r += uintptr(fastrand()) << 31
}
it.startBucket = r & bucketMask(h.B)
it.offset = uint8(r >> h.B & (bucketCnt - 1))
it.bucket = it.startBucket

We generate a random number using fastrand() and then use it to get the starting bucket and a random offset within that bucket (remember, maps in go are implemented as an array of buckets with 8 elements in each bucket). mapiternext then iterates over the elements to return the first valid entity — while doing so, it skips over any empty ones

for ; i < bucketCnt; i++ {
  offi := (i + it.offset) & (bucketCnt - 1)
  if isEmpty(b.tophash[offi]) || b.tophash[offi] == evacuatedEmpty {
    // TODO: emptyRest is hard to use here, as we start iterating
    // in the middle of a bucket. It's feasible, just tricky.
        continue
  }
  ...
}

Because the element we start with could be empty, the probability of getting a valid element is actually dependent on the number of empty buckets and elements immediately preceding it. For example, if there is 1 bucket with 2 valid entities like in the example below —

[NULL, NULL, 10, NULL, NULL, NULL, NULL, 20]

We’ll get 10 if we start with elements 0, 1 or 2 and 20 if we start with 3, 4, 5, 6 or 7. So the perceived probability of getting a 10 is 3/8 and for 20 is 5/8 .

While this was a toy problem that I was trying to solve, the broader learning for me was to not base solutions on ones interpretation of library documentation. It’s almost always a good idea to test how things behave in practice even if the documentation feels clear and correct.

Common traps while using defer in go

Aniruddha — Tue, 20 Mar 2018 22:02:24 GMT

The defer statement in go is really handy in improving code readability. However, in some cases its behavior is confusing and not immediately obvious. Even after writing go for over 2 years, there are times when a defer in the wild leaves me scratching my head. My goal is to compile a list of behaviors which have stumped me in the past, mainly as a note to myself.

Defer scopes to a function, not a block

A variable exists only within the scope of a code block. However, a defer statement within a block is only executed when the enclosing function returns. I’m not sure what the rationale for this is, but it can catch you off guard if you’re, say, allocating resources in a loop but defer the deallocation.

func do(files []string) error {
  for _, file := range files {
    f, err := os.Open(file)
    if err != nil {
      return err
    }
    defer f.Close() // This is wrong!!
    // use f
  }
}

Chaining methods

If you chain methods in a defer statement, everything except the last function will be evaluated at call time. defer expects a function as the “argument”.

type logger struct {}
func (l *logger) Print(s string) {
  fmt.Printf("Log: %v\n", s)
}

type foo struct {
  l *logger
}

func (f *foo) Logger() *logger {
  fmt.Println("Logger()")
  return f.l
}

func do(f *foo) {
  defer f.Logger().Print("done")
  fmt.Println("do")
}
 
func main() {
  f := &foo{
    l: &logger{},
  }
  do(f)
}

Prints —

Logger()
do
Log: done

The Logger() function is called before any of the work in do() is executed.

Function arguments

Okay, but what if the last method in the chain takes an argument? Surely, if it is executed after the enclosing function returns, any changes made to the variables will be captured.

type logger struct {}
func (l *logger) Print(err error) {
  fmt.Printf("Log: %v\n", err)
}

type foo struct {
  l *logger
}

func (f *foo) Logger() *logger {
  fmt.Println("Logger()")
  return f.l
}

func do(f *foo) (err error) {
  defer f.Logger().Print(err)
  fmt.Println("do")
  return fmt.Errorf("ERROR")
}
 
func main() {
  f := &foo{
    l: &logger{},
  }
  do(f)
}

Guess what this prints?

Logger()
do
Log:

The value of err is captured at call time. Any changes made to this variable are not captured by the defer statement because they don’t point to the same value.

Calling methods on non-pointer types

We saw how chained methods behave in a defer statement. Exploring this further, if the called method is not defined on a pointer receiver type, calling it in a defer will actually make a copy of the instance.

type metrics struct {
  success bool
  latency time.Duration
}

func (m metrics) Log() {
  fmt.Printf("Success: %v, Latency: %v\n", m.success, m.latency)
}

func foo() {
  var m metrics
  defer m.Log()

  start := time.Now()
  // Do something
  time.Sleep(2*time.Second)
  
  m.success = true
  m.latency = time.Now().Sub(start)
}

This prints —

Success: false, Latency: 0s

m is copied when defer is called. m.Foo() is basically shorthand for Foo(m)

Conclusion

If you’ve spent enough time writing go, these might not feel like “traps”. But for someone new to the language, there are definitely a lot of places where the defer statement does not satisfy the principle of least astonishment. There are a bunch of other places that go into more detail about some other common mistakes while writing go. Do check them out.

Runtime overhead of using defer in go

Aniruddha — Wed, 07 Mar 2018 08:37:40 GMT

Golang has a pretty nifty keyword named defer. As explained here, a defer statement pushes a function call onto a list. The list of saved calls is executed after the surrounding function returns. Defer is commonly used to simplify functions that perform various clean-up actions.

Using defer, however, is not free. Using go’s benchmarking support, we can try to quantify this overheard.

The following two functions do the same work, but one calls a function in a defer statement while the other doesn’t

package main

func doNoDefer(t *int) {
  func() {
    *t++
  }()
}

func doDefer(t *int) {
  defer func() {
    *t++
  }()
}

Let’s benchmark these —

package main

import (
  "testing"
)

func BenchmarkDeferYes(b *testing.B) {
  t := 0
  for i := 0; i < b.N; i++ {
    doDefer(&t)
  }
}

func BenchmarkDeferNo(b *testing.B) {
  t := 0
  for i := 0; i < b.N; i++ {
    doNoDefer(&t)
  }
}

Running this with go -bench on an 8 core google cloud VM gives us

⇒ go test -v -bench BenchmarkDefer -benchmem
goos: linux
goarch: amd64
pkg: cmd
BenchmarkDeferYes-8  20000000   62.4 ns/op  0 B/op  0 allocs/op
BenchmarkDeferNo-8   500000000  3.70 ns/op  0 B/op  0 allocs/op

As expected, both these functions don’t allocate any memory. But doDefer is roughly 16 times more expensive than doNoDefer. To understand why defer is this expensive, let’s look at the disassembled code.

The disassembly for the actual functions called inside doDefer and doNoDefer is the same

main.go:10   MOVQ 0x8(SP), AX
main.go:11   MOVQ 0(AX), CX
main.go:11   INCQ CX
main.go:11   MOVQ CX, 0(AX)
main.go:12   RET

The doNoDefer sets up the necessary registers and then calls main.doNoDefer.func1

TEXT main.doNoDefer(SB) main.go
main.go:3  MOVQ FS:0xfffffff8, CX
main.go:3  CMPQ 0x10(CX), SP
main.go:3  JBE 0x450b65
main.go:3  SUBQ $0x10, SP
main.go:3  MOVQ BP, 0x8(SP)
main.go:3  LEAQ 0x8(SP), BP
main.go:3  MOVQ 0x18(SP), AX
main.go:6  MOVQ AX, 0(SP)
main.go:6  CALL main.doNoDefer.func1(SB)
main.go:7  MOVQ 0x8(SP), BP
main.go:7  ADDQ $0x10, SP
main.go:7  RET
main.go:3  CALL runtime.morestack_noctxt(SB)
main.go:3  JMP main.doNoDefer(SB)

The doDefer function also sets up registers, but there are additional function calls — the first one to runtime.deferproc which sets up the deferred function to be called. The second one is to runtime.deferreturn — which in turn calls itself for every defer statement encountered in the function.

TEXT main.doDefer(SB) main.go
main.go:9    MOVQ FS:0xfffffff8, CX
main.go:9    CMPQ 0x10(CX), SP
main.go:9    JBE 0x450bd3
main.go:9    SUBQ $0x20, SP
main.go:9    MOVQ BP, 0x18(SP)
main.go:9    LEAQ 0x18(SP), BP
main.go:9    MOVQ 0x28(SP), AX
main.go:12   MOVQ AX, 0x10(SP)
main.go:10   MOVL $0x8, 0(SP)
main.go:10   LEAQ 0x218e3(IP), AX
main.go:10   MOVQ AX, 0x8(SP)
main.go:10   CALL runtime.deferproc(SB)
main.go:10   TESTL AX, AX
main.go:10   JNE 0x450bc3
main.go:13   NOPL
main.go:13   CALL runtime.deferreturn(SB)
main.go:13   MOVQ 0x18(SP), BP
main.go:13   ADDQ $0x20, SP
main.go:13   RET
main.go:10   NOPL
main.go:10   CALL runtime.deferreturn(SB)
main.go:10   MOVQ 0x18(SP), BP
main.go:10   ADDQ $0x20, SP
main.go:10   RET
main.go:9    CALL runtime.morestack_noctxt(SB)
main.go:9    JMP main.doDefer(SB)

deferproc and deferreturn are both non-trivial functions and they do a bunch of accounting and setup at entry and exit. In short, don’t use defer in hot code paths. The overhead is non-trivial and not obvious.

Memory Mapped Files

Aniruddha — Sat, 03 Feb 2018 09:27:15 GMT

Memory mapping of files is a very powerful abstraction that many operating systems support out of the box. Linux does this via the mmap system call. In most cases where an application reads (or writes) to a file at arbitrary positions, using mmap is a solid alternative to the more traditional read/write system calls. We’ve used it in the analytics database at Mixpanel to improve performance or make code more readable and I wanted to spend some time figuring out what actually happens under the hood.

At a high level, the mmap system call lets you read and write to a file as if you were accessing an array in memory. There are two main modes in which files can be mapped — MAP_PRIVATE and MAP_SHARED. In MAP_PRIVATE, any changes that you make to the file are in memory and not written back to it. In MAP_SHARED, changes made to the file are visible to other memory mappings of that file and are eventually committed to disk.

To understand what happens on calling mmap, it’s important to understand two things — how linux handles files and how memory addressing works.

You can open a file for reading or writing using the open system call. This returns a file descriptor. Linux maintains a global file descriptor table and adds an entry to it representing the opened file. This entry is represented by the file structure which is local to the process. Internally, linux uses the inode struct to represent the file. The file struct has a pointer to this and linux ensures that multiple file descriptors that touch the same file point to the same inode so that their changes are visible to each other. The i_mapping field on the inode struct is what’s used to get the right set of pages from the page cache for an offset in the file.

In linux, processes have a virtual memory address space that’s, well, virtual. This memory is not usually backed by physical memory unless you’re actually reading or writing to some part of it. Linux further divides the memory space into equal sized pages and a page is the unit of access as far as the kernel is concerned. So, when a process calls mmap, the short answer is that nothing really happens. The kernel simply reserves some part of this virtual memory address space and returns the address. The do_mmap function is what eventually gets called after some bookkeeping and does most of the work for allocating this virtual memory in the process’ address space. This function stores a pointer to the file struct in the vm_area_struct struct that represents the returned address.

When the process accesses the address, a page fault occurs. The page fault handler locates the vm_area_struct struct in the process’s address space and eventually finds the pages in the page cache that map to the file offsets being accessed. These pages are marked as dirty if there’s a write and mapped directly to user space — this way there is no need to copy data from kernel to user space.

Once you’re done using the memory mapped area, the munmap system call can be used to free up the memory. Any data written to the page cache is periodically committed to disk, although you can force it with msync. While mmap is useful, it definitely has drawbacks. Misses in the page cache always result in the page being read into the cache even if a write is going to overwrite the contents. Offsets need to be aligned to page boundaries. Error handling happens via signals because there is no way to indicate otherwise. And finally, you can’t mmap all types of file descriptors(pipes for example). As usual, conditions apply — so make sure you don’t use mmap indiscriminately..

Writing tests in Go

Aniruddha — Tue, 23 Jan 2018 10:32:16 GMT

Recently, I bumped into this article by Segment’s engineering team. It has a lot of good advice and some helpful links about writing good tests in Go. I wanted to discuss a few more things that I’ve found useful in the two-ish years that I’ve been using the language.

For those of you who haven’t used Go, consider using it for your next side project. It’s a very opinionated programming language with a spec that you can mostly hold in your head and a fairly comprehensive standard library. Also, unlike most other languages, you don’t have to deal with a ton of testing frameworks. Although the standard library provides good support for writing tests, I’ve found the following techniques useful in writing better testable code.

It’s okay to test unexposed functions

The default package layout in Go encourages housing code and tests in the same package. Tests can access “private” functions and members — this is okay. I’ve mostly encountered this in the context of helper functions that are only used within a package. The alternative is to expose them publicly, which has its drawbacks.

Control time

Avoid using the time package to block or schedule execution. Consider using something like clockwork to pass in a fake, controllable clock in unit tests. Controlling time lets you write more deterministic unit tests. This is useful when you’re testing behavior that depends on time — timeouts, retries, scheduled runs etc.

Use Go’s race detector

Data access races are really hard to debug. Fortunately, Go has support for detecting them — so use it. This is a good starting point to understand how to use the race detector. Remember that it will only test the code paths that your tests execute. So you still need to write a test that exercises the race.

Write benchmarks

Go makes writing benchmarks easy. This is a good starting point to understand how to write them. Make sure you have benchmarks for the performance sensitive parts of your code.

Use setup functions

This is useful if you want to setup some external state that is used by the function or implementation being tested. An example would be something that operates on a directory. Instead of having every test function create a temporary directory and clean up after itself, write a generator function that does this.

func withTempDir(t *testing.T, f func(d string)) {
 dir, err := ioutil.TempDir(...)
 if assert.NoError(err) {
  defer os.RemoveAll(dir)
  f(dir)
 }
}

func Test(t *testing.T) {
 withTempDir(t, func(dir string) {
  // use dir in test
 })
}

Accept interfaces, return structs

Interfaces can be mocked; structs cannot. Having interfaces as member variables makes it easy to mock their behavior. Returning structs (concrete implementations) means that the caller gets to decide how to use the returned value.

That said, use mocks carefully. With mocks, you’re testing your understanding of the interface, at the time the test was written. While this is ideal, it’s not always practical — especially in high velocity codebases. If you think the underlying implementation is unstable, test it in a separate package to avoid diverging.

Lastly, use a mock generator like mockery instead of writing them yourself.

Use self referential interfaces

This is a neat trick that I’ve found useful for testing behavior that is either non-deterministic or doesn’t fit well in a unit test because it makes network calls or depends on an external service. Let’s say you want to test the behavior of a function A() on a struct of type Foo that makes a non-deterministic function call that uses a member variable (like a network connection) in Foo . An easy way to do this is to move the non-determinism into a function B() on Foo and introduce a new member variable on Foo that satisfies an interface exposed by B() and call B() on this member. The actual code can use an instance of Foo as the member variable and the tests can provide a mock. The code below should make things clearer.

package main

import (
 "fmt"
)

type doer interface {
 B()
}

type Foo struct {
 msg string
 d doer
}

func (f *Foo) B() {
 fmt.Printf("i am non deterministic: %v\n", f.msg)
}

func (f *Foo) A() {
 f.d.B()
 fmt.Println("test me")
}

func main() {
 x := &Foo{
  msg:"go",
 }
 x.d = x // x.d = MockDoer() in tests
 x.A()
}

Although many of these techniques are useful, deciding where to use them is always a judgement call. Choose wisely!