Sebastian participated in a CTF (capture the flag) a couple of months ago. One challenge he faced was the task of restoring a git repository from a directory listing enabled webserver. With directory listing, it was pretty easy, but Sebastian was curious if it’s possible to restore git respositories without directory listing and how common this misconfiguration flaw is.
With that idea in mind, we began to develop some tiny tools and started to do some research. The results were not as bad as anticipated, but nevertheless surprising.
Some websites host their version control repository (e.g.
.git/) in production. Bad people can use tools to download/restore the repository to gain access to your website’s sourcecode. Check your webserver’s configuration now and make sure that it blocks access to these folders.
What is a version control system?
Decades ago programmers were facing a serious problem - (remotely) developing a tool together. In order to change this, version control systems (VCS) were created. The primary task of these is to make distributed work on one codebase possible. This can be achieved by keeping track of each code change (often called commit). One well-known VCS is called
git and is maintained by the infamous Linus Torvalds. You probably also heard of the online code hosting platform github.com, which offers to host your
During our research, we focused on the
There are plenty of other VCS which we’re not covering here, but the same issue may apply too
- SVN (Apache Subversion) - Another scan netted us about 900 vulnerable sites
- HG (Mercurial)
- CVS (Concurrent Versions System)
Why is hosting your VCS in production bad?
When deploying a web application, some administrators simply clone the repository. Most version control systems create a meta/tracking folder in the root directory of the project. For example:
.git/folder containig a full copy of the repository.
- Others may follow a similar approach (e.g. SVN)
You probably may get the idea what bad boys can do if you do not deny access to the client side repositories. Not only does it most certainly contain your website’s sourcecode and all previous revisions, but sometimes also configuration files with sensitive information. This gives attackers a kick-start for hacking your website, because they can use the sourceode to find more severe security issues.
Downloading the website’s sourcecode
So let’s get to the interesting part - How do we download and restore the aforementioned respositories to get access to the website’s sourcecode? Basically there are two ways to do it:
- Easy way, if webserver has directory listing enabled
- Hard way, otherwise
As mentioned before, most version control systems manage the repository in a lot of small files (objects). The filenames are often the result of a hash function, so guessing them is hard. We need to find a way to obtain as many of those files as possible.
The easy way
First of all, it’s considered bad practice to have directory listing enabled on your production server. If you have, stop reading and fix that first :) - but that’s not enough to stop the attacker. (see “The hard way”)
Directory-listing helps the attacker a lot, because all he has to do is to issue one command to download all files.
It’s enough to run the following wget command:
After the download has finished, switch into the new folder. A fancy shell (e.g. fish or zsh) should tell you that you’re in a git-tracked directory (see the ‘master’-branch hint).
1 2 3 4 5
git status shows only deleted files, because we only have downloaded the
1 2 3 4 5 6 7 8 9 10 11 12
git checkout -- . to reset the repository, we recovered all files.
1 2 3 4 5 6 7 8 9 10 11
We have sucessfully obtained a copy of the website’s sourcecode.
The hard way
As mentioned before, our intentation was to see if there’s a way to download the repository without directory listing. For this, you have to dig a bit into the git-internals to understand how git is managing the repository.
We won’t go into too much detail here - we recommend the appropiate chapters on git-scm.com/book for interested readers -, but basically there are three kind of objects in a git repository:
- Blob - The actual data (e.g. sourcecode)
- Tree - Grouping blobs together
- Commit - A specific state of a tree with more meta information (e.g. author/date/message)
All these together are used by git under the hood to maintain the repository. However, the problem that we face is, that these objects are stored as
.git/objects/[First-2-bytes]/[Last-38-bytes] files, where [First-2-bytes][Last-38-bytes] is the SHA1-hash of the object. We need to be smart and guess/extract the filenames of all objects to completely restore the repository, because brute forcing the SHA1 keyspace isn’t a good idea as it would be too time consuming.
What helps us a lot is the fact that there are some standard files in a git repository:
These files either refer an object by its hash or another file referencing an object and so on. Thus the easiest way is to start with downloading and parsing the aforementioned files. We need to parse these to continue to download the object files.
So for example, we have downloaded the
master points to a commit with the hash
6916ae52c0b20b04569c262275d27422fc4fcd34. After downloading the commit-object from the server (note the url should be
.git/objects/69/16ae52c0b20b04569c262275d27422fc4fcd34), we can analyse it further:
This tells us, that the downloaded object is indeed a commit. Let’s get some details about it:
1 2 3 4 5 6 7
Okay, now we know the hash of the related tree and parent object as well as some information about the author, the committer and the commit message.
We download the tree-object and analyse it:
1 2 3 4 5
This tells us which files are stored in that tree. Note that
Finder are also trees (directories). The final step is to download the README.md blob object and cat its content:
1 2 3 4
We need to take special care of packed files. We can find a list of all packs in
The appropiate pack file is stored in
1 2 3
In that case, we need to download both files and then run the following command to extract the packed data:
As you can see, by doing this procedure recursively and for every possible hash, which we find in the already downloaded files, we can slowly restore the repository and extract the contents.
Sometimes downloading a specific object will fail, leaving us with an incomplete repository. In that case, we can use
git fsck command to search for these missing/broken object files.
We’ve released our python/bash scripts used for this research on github: https://github.com/internetwache/GitTools
We used three different tools: A tool to discover, one to download and one to extract git repositories.
Preview of the recovery tool:
Scanning Alexa’s Top 1M
After running the ‘Finder’ on the Alexa Top 1M list, we found about 9700 public accessible Git repositories - that means that only <1% is prone to to this kind of attack.
Taking a look at the research data, we discovered the following mayor-effected business sectors and websites:
- Big websites of german und US political parties / NGOs and a few governmental websites (.gov)
- MTV-channels and radio stations (>20)
- Online communities (one of it with >6 million users)
- Trading websites (one with bitcoin and many other “banking websites”)
- A very famous privacy online service
- Soccer clubs of the german “Bundesliga”
- Porn websites
- Bigger and smaller online shops
It seemed like an accessible git repository was intended on some websites - mostly open source projects where the website’s sourcecode is available online.
The more the Alexa rank descended, the higher was the probability of finding a website which was affected by this issue.
Here is an overview of the most prominent top level domains.
It’s interesting to see the distribution of protocols used. Especially that unencrypted protocols like
git are used.
A lot of vulnerable websites use either GitHub or BitBucket as the remote.
Most common branch names. It’s interesting to see some
develop branches here.
On the other side, we had to hold our breath when we noticed that more than 100 projects used HTTP-Authentication for server-client communication.
That means, that the
protocol://user:password@host/repository combination is saved in the
.git/config file, giving attackers access to the users (companies) GitLab-instance or GitHub/BitBucket account. With a bit of luck an attacker gets access to the CI-Server and then runs malicious code to further compromise your infrastructure.
Other than that, we’ve found a lot of AWS/Database/SMTP/FTP credentials in some repositories.
How to fix this issue?
First: GIT is a great tool - if you use it right. So not using GIT is no option, you should rather look at the access rights of your webserver - we have prepared a list to fix the issue.
It’s really easy to deny access to
1 2 3
1 2 3 4
Put that into your
1 2 3
Put that as the first entry in your
server-block in the
First, you need to enable the
After that, we can block access to the
1 2 3
Put that into your
Another approach is to use git’s
--work-tree switches to move the git repository out of the document root.
There are a bunch of famous websites which do not deny access to the
/.git/ folder - anyone may download their sourcecode and possibly other sensitive data. This issue isn’t hard to mitigate, so take a minute to make sure that your webserver isn’t misconfigured.
Stay safe, the team of internetwache.org