Using Git Subtree for Repository Mirroring

In a recent client project, we needed to mirror a specific subdirectory from our GitLab monorepo to a GitHub repository–complete with version history. This article explains why we chose Git Subtree over simpler approaches, the challenges we encountered, and how we implemented a mirroring solution in our CI pipeline.

The Original Problem

Our team maintains a monorepo in GitLab containing both app and deployment code. Look below for a visualization of the directory structure:

├── deployments/
├── app/   (Root directory in GitHub)
├── tests/
├── .gitlab-ci
├── .dockerignore
├── Dockerfile
└── README.md

A client requires access to only the app portion of the codebase, hosted on GitHub. We enforced the following requirements:

  • Only the app/ code should be visible. It should also be the root.
  • Tags and new versions of the codebase should mirror seamlessly
  • The solution needs to be automated through GitLab CI
  • Ideally, version history is preserved

At first, we initialized a new Git repository in app/ and pushed the code from app/ to GitHub. While this worked, GitHub displayed a concerning warning for every push/tag:

This commit does not belong to any branch in this repository, and may belong to a fork outside of this repository.

This warning raised questions about whether our workflow followed best practices and prompted us to seek a better solution. Plus, the warning could concern the client as well.

What is Git Subtree?

Git Subtree is a contributed module that enables you to nest one repository inside another as a subdirectory while maintaining a separate history. The key command we use is git subtree split, which extracts the history of a subdirectory and creates a synthetic branch containing only that content.

A visualization of what Git Subtree split does

For instance let’s imagine our monorepo main branch has 6 commits and we want to push our app repo only:

A---B---C---D---E---F (main)

Where each commit touches different parts of the repo:

  • A: add app skeleton (app/)
  • B: add test scripts (tests/)
  • C: update app styles (app/)
  • D: modify CI jobs (.gitlab-ci.yml)
  • E: new app feature (app/)
  • F: update README (README.md)

So the history is mixed: some commits touch app/, some don’t.

We now run git subtree split --prefix=app main. This scans the main branch and filters out only commits that touched app/. So this will result in a new commit hash with the following history:

A'---C'---E'
  • A’ = copy of A, but with repo root = app/
  • C’ = copy of C (app changes only)
  • E’ = copy of E (app changes only)
  • Commits B, D, and F are dropped, because they never touched app/.

The commit hash is of the tip of the graph; E’. Now we can push this commit E’ as we would have pushed a commit regularly.

This example was created with the help of ChatGPT.

Why Git Subtree?

The warning we encountered on GitHub stems from how Git tracks commits across repositories, and our method of pushing the code. When you force-push commits from one repository to another without properly rewriting history, Git maintains references to the original parent commits that don’t exist in the destination repository. This creates orphaned commits that technically work but violate Git’s expected commit graph structure. If you want to learn more, this stack overflow answer goes in depth regarding this.

How does git subtree solve this problem?

  • Extracting only the relevant subdirectory’s history.
  • Rewriting commit hashes to create a clean, independent history.
  • Preserving commit messages and author information.

Unlike git push --force, which blindly copies commits, Git Subtree creates a proper commit graph that GitHub can recognize as a legitimate version history.

Installation Notes

Git Subtree isn’t always included by default, so installation varies by platform:

  • Alpine Linux: apk add git-subtree
  • Ubuntu/Debian: Usually included with git, but can install via apt-get install git-subtree.
  • macOS: Included with Git when installed via Homebrew (brew install git).

Our Implementation of Git Subtree

Here’s our GitLab CI job that mirrors the app directory to GitHub:

mirror to github:
  stage: build
  before_script:
    - apk add --no-cache openssh git-subtree
    # setup github deployment key
  script:
    # setup git config 
    - git remote add github ssh-link
    - git fetch --unshallow
    - SUBTREE_COMMIT=$(git subtree split --prefix=app $CI_COMMIT_REF_NAME)
    - git tag -d $CI_COMMIT_TAG
    - git tag -a $CI_COMMIT_TAG $SUBTREE_COMMIT -m "$CI_COMMIT_MESSAGE"
    - git push github $SUBTREE_COMMIT:main $CI_COMMIT_TAG
  rules:
    - if: $CI_COMMIT_TAG

Let’s break it down:

  1. git fetch --unshallow: GitLab CI often utilizes shallow clones for performance. We need the full history for subtree operations to work correctly.
  2. git subtree split --prefix=app $CI_COMMIT_REF_NAME: This is where the magic happens. It extracts all commits affecting the app/ directory and creates a new commit hash representing that filtered history. We use $CI_COMMIT_REF_NAME to ensure the CI is using the branch/tag expected (in development we faced issues with the CI using a not-so-current branch at a tag that was passed in).
  3. git tag -d $CI_COMMIT_TAG: We delete the existing tag (we create and push a tag to GitLab in order to run this job) before creating a new one. This is necessary because the original tag points to the monorepo commit, but we need it to point to the new subtree commit hash instead.
  4. git push github $SUBTREE_COMMIT:main $CI_COMMIT_TAG: We push both the subtree commit to the main branch and the newly created tag in a single operation.

A Note About Git Subrepo

Some further research led us to git-subrepo, a third-party tool that aims to improve upon Git Subtree’s and git submodule’s functionality. git-subrepo offers some advantages:

  • Clear and intuitive command-line syntax
  • Better handling of bidirectional syncing of repositories.

Bidirectional would mean that we expect to not only push to the target repo, but also pull new changes from that repo.

  • More robust for complex workflows and multiple target repository locations.

In our particular use case, we don’t expect our client’s GitHub repo to receive any contributions; our mirroring case is unidirectional (we expect to only push to the target repo, not pull). Git Subtree is sufficient for our straightforward mirroring needs.

Final Thoughts

Git Subtree transformed our mirroring workflow from a hacky force-push solution to a clean, Git-compliant process. The GitHub warnings disappeared, our commit history is current and the client receives exactly what they need.

The key lesson: when Git complains about your workflow with warnings, it’s usually trying to tell you there’s a better way. In our case, Git Subtree was that better way :).

Resources