Part 2 of 5
By: Dan Cohn, Sabre Labs
The first post in this series relays how we at Sabre began a multi-year effort to increase the efficiency and morale of software developers across the company. Early in the journey, we decided to house our source code in a monorepo, a single repository that would encourage consistency, foster transparency, and simplify collaboration among teams. It would serve as the foundation for many of our “EngProd” (engineering productivity) efforts.
As we ventured into the realm of monorepos, it was clear that we’d have to overcome some technical challenges. For starters, how do you clone (or download) a large monorepo into each user’s development environment without consuming excessive amounts of time, storage, and bandwidth? How do you maintain uniformity across the many contents of the repo? And how do you prevent users from “bloating” the shared repo with large binary files or other questionable content?
Although monorepos are not a new concept, in the early days of Git they became less common in favor of smaller repositories containing individual microservices or software libraries. Recently, a few trailblazing software companies have begun adopting monorepos once again. Some companies – most famously Google – have been using monorepos for years but built them on proprietary platforms with homegrown tooling.
One of our primary objectives was to reuse as much free and open-source software as possible. We were certainly not eager to invent our own version control system. To the contrary, we were eager to make Git work for us because of its widespread use within Sabre and popularity in the software development world at large. Git was by far the most commonly used tool among professional software developers in Stack Overflow’s 2021 Developer Survey, with a whopping 94.41% saying they use Git in their work.
Git is a distributed version control system (DVCS) that helps track changes in a set of files. With a DVCS, each user has their own copy – known as a clone – of the centralized repository. This is beneficial from a reliability standpoint and enables users to continue working while offline. However, the “everyone has their own private copy of the repo” approach becomes unwieldy when dealing with larger repositories. Fortunately, Git provides several features that facilitate working with large repos:
- Sparse checkouts
- Shallow clones
- Partial clones
- Single branch clones
- Sparse indexes
Monorepo-friendly Git features
Sparse checkouts limit the number of files available for viewing and editing in a user’s local checkout of the repo. This is especially valuable in a monorepo with many files and directories. Aside from the obvious benefit of reducing load on the filesystem, sparse checkouts minimize the number of files that Git must unpack when first cloning the repo and later when switching branches and pulling down updates. It also means fewer files to track for changes. This is a critical feature from a usability standpoint since it allows developers to focus on their own project files and prevents tools like IDEs from having to index the entire repository (for auto-completion, cross-reference, and search purposes).
Shallow clones restrict the number of commits – transactions that capture snapshots of the project’s changes over time – that Git downloads from the remote server. With a shallow clone, the commit log displays only the most recent commits on a given branch. This feature allows you to select a desired commit “depth” and later increase or decrease it on subsequent fetches. Fewer commits mean less data to download and process.
Partial clones enable you to postpone the download of certain objects until they are required. When used in conjunction with sparse checkout, the Git client avoids downloading file contents (known as “blobs”) for files that aren’t visible in the working directory tree. The result is a much faster clone with a smaller footprint. It is worth noting that this feature is relatively new and requires server-side support. The types of filters that are available depend on what the SCM offers.
Combining these three features gives you a clone command that looks something like this:
git clone --no-checkout --filter=blob:none --depth=100 https://…
This command produces a partial, shallow clone of the specified repo with no files checked out. On its own, this is of limited use because you end up with an empty directory (other than the “.git” directory where Git keeps its data). The next step is to activate the sparse checkout feature and add file patterns of interest. We take advantage of the “cone pattern set” to improve performance and simplify the sparse checkout patterns to what is essentially a list of directory paths. For example:
git sparse-checkout init --cone
git checkout main
git sparse-checkout add apple banana carrot
The result is a partial checkout of the “main” branch with all the root-level files present along with the full contents of the apple, banana, and carrot directories.
As it turns out, this process has also produced a single branch clone containing the commit history for just one branch (the main branch). Although there is a separate flag for it (--single-branch), use of the “depth” option when cloning implies a single branch clone. What does that mean? It means that Git does not fetch remote branch references and all their associated commits. For a monorepo that may have thousands of branches, this is another performance boost for the initial clone. On the downside, to view the list of available branches and check one of them out requires a set of commands that may be less familiar to many Git users. Example below.
# list all branches in the repo
git ls-remote -h origin
# checkout a remote branch named "development"
git config --add remote.origin.fetch \'+refs/heads/development:refs/remotes/origin/development'
git fetch origin development
git checkout development
The fifth and final feature in our list is the sparse index. This new feature was officially released in June 2022 as part of Git 2.37 and is touted to significantly improve the performance of many Git operations on a large repo with a sparse checkout. Truth be told, we have not started using it yet, but I look forward to trying it out. You can enable the sparse index feature in conjunction with sparse checkout as follows:
git sparse-checkout init –cone --sparse-index
Establishing monorepo-friendly policies
Git features will take you only so far when it comes to maintaining a monorepo shared by many disparate teams. Like a communal living space in a large multi-family home, the environment will quickly become unlivable if you don’t establish some ground rules and figure out how to enforce them. We made several decisions early on (and added some later) with the goal of maintaining a healthy repo for our user community. They include:
- A branching model
- Code ownership
- Repository rules
- Large file handling
- Merge strategy
- Naming conventions
- Pre-commit hooks
We follow a trunk-based development model with short-lived topic branches – this means that developers frequently merge small updates into the main branch. This approach has two main advantages: it creates discipline around the content of the main branch, ensuring that it is always reasonably up-to-date and stable from a code-build perspective. Short-lived topic branches result in more frequent but smaller pre-merge code reviews (typically known as pull requests) that are more manageable for code owners. By nature, these branches contain fewer commits, in most cases created by a single developer per branch. This means there is a clear owner for each branch, and most code integration occurs on the main branch, or trunk.
In addition, we discourage the use of release branches and require a formal exception to create one. This limits the number of long-lived branches that might clutter the repo. It also aligns with our goal of delivering software faster through continuous integration and deployment (CI/CD). The topic of release branches could easily fill its own blog post, so let’s set it aside for now.
Code ownership is a must for almost any repo and essential for a monorepo. Code owners are defined in Github-style CODEOWNERS files sprinkled throughout the repo. Each file designates a list of users responsible for the files in a particular directory tree. A lower-level CODEOWNERS file overrides the one found in higher-level directory. One or more of the owners must approve each change – in the form of a pull request – before the author is allowed to merge it into the main branch. Please note that code ownership and pull requests are not core Git features and require server-side support that varies from product to product.
Again, depending on what the Git server supports, there are often many configurable repository rules, some in the form of server-side “hooks” (scripts that are triggered automatically whenever a certain action occurs in the repository). We take advantage of these rules to help maintain consistency, including the following:
- No forking. Imagine if users created a fork of the entire monorepo for every new project or change they needed to make. Not only would this be a huge strain on the Git server, but it would go against the convention of submitting changes via topic branches within the monorepo itself.
- No submodules. Submodules add complexity and create more of a multi-repo environment masquerading as a monorepo.
- No commits to trunk without a pull request. There are some exceptions to this rule to allow for automation and emergency fixes, but the general policy is that users cannot push commits directly to the main branch.
- No history rewrites on trunk. Hopefully the reasons are obvious. I shudder to think of the implications of amended or deleted commits on the main branch shared by everyone and used as the source of all topic branches.
Large file handling relates back to clone and checkout performance. Even with partial cloning, the presence of extremely large files would create a burden for the Git server and for anyone who inadvertently checked out a directory containing one or more of these files. In the early days of our monorepo, we tried a solution called Git LFS (large file storage) that replaces files above a certain size with a URL and stores the actual files in a separate repository. Unfortunately, in addition to being cumbersome for each user to set up, LFS does not play nicely with sparse checkouts. We learned this the hard way and decided to abandon it in favor of file size limits. In our case, the maximum file size is 5 MB unless it falls in the category of “files that don’t really belong in a code repository” – in other words, most binary files except graphical images for web UIs – in which case the limit is 1 MB. We enforce this through a combination of client and server-side hooks.
Although arguably less important, it is good to have a standard merge strategy for your repository. Our preferred strategy is known as Squash (or --squash). It combines the commits from the source branch into a single commit on trunk that (typically) references the pull request associated with the merge. This strategy minimizes the number of commits on the main branch, thereby making the commit history a little more manageable.
Naming conventions provide another level of consistency across the monorepo. This applies primarily to directories and branches. We disallow spaces and certain specific characters in directory names that tend to create problems for tools that interface with the repo. Top-level directory names must be lowercase with underscores in place of spaces or hyphens. By convention, we try to limit the number of top-level directories by using broad categories of functionality.
Finally, in the vein of ensuring consistency, our monorepo (like many other code repositories) requires the use of pre-commit hooks. These are specific automated checks that run when a developer commits a set of changes to their local repo. We have close to 40 hooks that run the gamut from linters and style formatters to security checks and more. Many are language-specific, so they only apply to certain file types.
Orchestration for a consistent user experience
By now you may be feeling a little overwhelmed by all the practices and conventions and wondering how you might train thousands of software engineers to follow them. In particular, the Git commands required to interact with the monorepo in an efficient manner are somewhat arcane and cumbersome. Git has powerful features, but it takes experience to know when and how to leverage them.
For this reason, we decided to create our own command line tool to replace, or at least supplement, the standard “git” client. As much as we wanted to avoid building our own custom tooling, we felt this was the best way to maintain consistency and create a better user experience through automation. To reduce the learning curve, we modeled the command line interface (CLI) after Git and reused as much Git functionality as possible. In other words, our CLI is a Git wrapper; it overlays the existing tool and assists the user by orchestrating some of its commands.
Our CLI is called “s2”, an abbreviation for sabre2, the name of our monorepo. The most basic function it provides is s2 clone, a single preparatory command that generates a minimal clone of the repo with a small set of default directories. It runs in under 30 seconds. Adding to the sparse checkout is a cinch with s2 get, a command that behaves just like git sparse-checkout add but with added features, such as dealing with relative paths inside subdirectories, and less typing. In truth, one of the initial motivations for the “s2” tool was the “experimental” nature of git sparse-checkout (according to its early documentation).
I won’t bore you with a rundown of the various “s2” commands. However, it has a few features that demonstrate the value of orchestration. For instance, the command for rebasing a branch starts out by fetching the latest main branch commits and, if necessary, increasing the commit depth until it finds the base commit for the branch to be rebased. The same is true for s2 update, a command that updates (“pulls”) the current branch with the latest commits from the Git server and filters out the “fast-forward” information that often fills many screens and provides little value in a monorepo because of the quantity of daily merges into the main branch.
Another unique feature is the s2 find command. One of the challenges of a sparse checkout is the inability to search the full contents of the repo when looking for certain code references or examples. Search is possible through the Git server, but that requires opening a web browser and doesn’t offer the power of shell tools for filtering and parsing the output. s2 find takes advantage of the git grep command to search the entire repo and not just the contents of the sparse checkout. But there’s a catch! The use of partial cloning renders git grep useless because Git has no local copy of files that have not been checked out. Our solution is to establish a local cache of the full repo (with a “.git” directory only and no files checked out) on the first use of s2 find and then refresh this copy on subsequent calls.
With a little effort, it is indeed possible to coax Git to work well with large repositories. The maintainers of Git clearly have monorepos in mind as they introduce new capabilities like the sparse index. Nevertheless, widescale use of Git with a monorepo is best attempted with automation and a strong set of policies that promote consistency and maintainability.
Are there still challenges? Definitely. One of them is how to distribute tools like “s2” and other prerequisites that developers need when working with the monorepo including those required by the pre-commit hooks. And what about build tools? Should the monorepo have a single build tool to rule them all or a polyglot of tools for different languages or team preferences? These are some of the topics we will tackle in parts 3 and 4 of this series.