Don't publicly expose .git or how we downloaded your website's sourcecode - An analysis of Alexa's 1M

Sebastian participated in a CTF (capture the flag) a couple of months ago. One challenge he faced was the task of restoring a git repository from a directory listing enabled webserver. With directory listing, it was pretty easy, but Sebastian was curious if it’s possible to restore git respositories without directory listing and how common this misconfiguration flaw is.

With that idea in mind, we began to develop some tiny tools and started to do some research. The results were not as bad as anticipated, but nevertheless surprising.

TL; DR

Some websites host their version control repository (e.g. .git/) in production. Bad people can use tools to download/restore the repository to gain access to your website’s sourcecode. Check your webserver’s configuration now and make sure that it blocks access to these folders.

What is a version control system?

Decades ago programmers were facing a serious problem - (remotely) developing a tool together. In order to change this, version control systems (VCS) were created. The primary task of these is to make distributed work on one codebase possible. This can be achieved by keeping track of each code change (often called commit). One well-known VCS is called git and is maintained by the infamous Linus Torvalds. You probably also heard of the online code hosting platform github.com, which offers to host your git repository.

During our research, we focused on the git-VCS:

There are plenty of other VCS which we’re not covering here, but the same issue may apply too

Why is hosting your VCS in production bad?

When deploying a web application, some administrators simply clone the repository. Most version control systems create a meta/tracking folder in the root directory of the project. For example:

  • git creates a .git/ folder containig a full copy of the repository.
  • Others may follow a similar approach (e.g. SVN)

You probably may get the idea what bad boys can do if you do not deny access to the client side repositories. Not only does it most certainly contain your website’s sourcecode and all previous revisions, but sometimes also configuration files with sensitive information. This gives attackers a kick-start for hacking your website, because they can use the sourceode to find more severe security issues.

Downloading the website’s sourcecode

So let’s get to the interesting part - How do we download and restore the aforementioned respositories to get access to the website’s sourcecode? Basically there are two ways to do it:

  • Easy way, if webserver has directory listing enabled
  • Hard way, otherwise

As mentioned before, most version control systems manage the repository in a lot of small files (objects). The filenames are often the result of a hash function, so guessing them is hard. We need to find a way to obtain as many of those files as possible.

The easy way

First of all, it’s considered bad practice to have directory listing enabled on your production server. If you have, stop reading and fix that first :) - but that’s not enough to stop the attacker. (see “The hard way”)

Directory-listing helps the attacker a lot, because all he has to do is to issue one command to download all files.

View of a .git directory with directory listing

It’s enough to run the following wget command:

1
wget --mirror -I .git TARGET.COM/.git/ 

After the download has finished, switch into the new folder. A fancy shell (e.g. fish or zsh) should tell you that you’re in a git-tracked directory (see the ‘master’-branch hint).

1
2
3
4
5
/tmp/test/ttrss.me (0) (master)
> /usr/bin/ls -a
. 
..  
.git

Running git status shows only deleted files, because we only have downloaded the .git folder.

1
2
3
4
5
6
7
8
9
10
11
12
/tmp/test/ttrss.me (0) (master)                                                                                                                                        
> git status | head -n 10 
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        deleted:       .buildpath
        deleted:       .gitignore
        deleted:       .htaccess
        deleted:       .project

After running git checkout -- . to reset the repository, we recovered all files.

1
2
3
4
5
6
7
8
9
10
11
/tmp/test/ttrss.me (0) (master) 
> /usr/bin/ls | head -n 10
apiatom-to-html.xsl
backend.php
cache
classes
config.php-dist
css
errors.php
feed-icons
image.php

We have sucessfully obtained a copy of the website’s sourcecode.

The hard way

As mentioned before, our intentation was to see if there’s a way to download the repository without directory listing. For this, you have to dig a bit into the git-internals to understand how git is managing the repository.

We won’t go into too much detail here - we recommend the appropiate chapters on git-scm.com/book for interested readers -, but basically there are three kind of objects in a git repository:

  • Blob - The actual data (e.g. sourcecode)
  • Tree - Grouping blobs together
  • Commit - A specific state of a tree with more meta information (e.g. author/date/message)

All these together are used by git under the hood to maintain the repository. However, the problem that we face is, that these objects are stored as .git/objects/[First-2-bytes]/[Last-38-bytes] files, where [First-2-bytes][Last-38-bytes] is the SHA1-hash of the object. We need to be smart and guess/extract the filenames of all objects to completely restore the repository, because brute forcing the SHA1 keyspace isn’t a good idea as it would be too time consuming.

What helps us a lot is the fact that there are some standard files in a git repository:

  • HEAD
  • objects/info/packs
  • description
  • config
  • COMMIT_EDITMSG
  • index
  • packed-refs
  • refs/heads/master
  • refs/remotes/origin/HEAD
  • refs/stash
  • logs/HEAD
  • logs/refs/heads/master
  • logs/refs/remotes/origin/HEAD
  • info/refs
  • info/exclude

These files either refer an object by its hash or another file referencing an object and so on. Thus the easiest way is to start with downloading and parsing the aforementioned files. We need to parse these to continue to download the object files.

So for example, we have downloaded the refs/heads/master file:

1
2
> cat .git/refs/heads/master 
6916ae52c0b20b04569c262275d27422fc4fcd34

The reference master points to a commit with the hash 6916ae52c0b20b04569c262275d27422fc4fcd34. After downloading the commit-object from the server (note the url should be .git/objects/69/16ae52c0b20b04569c262275d27422fc4fcd34), we can analyse it further:

1
2
> git cat-file -t 6916ae52c0b20b04569c262275d27422fc4fcd34 
commit

This tells us, that the downloaded object is indeed a commit. Let’s get some details about it:

1
2
3
4
5
6
7
> git cat-file -p 6916ae52c0b20b04569c262275d27422fc4fcd34 
tree fa3887a0b798346c122afdd7c5ecc605bf3c18c0
parent 9264d57c621f66208d689ef653ce8a62c3bccfae
author XY <foo@bar> 1429391394 +0200
committer XY <foo@bar> 1429391394 +0200

Added another readme file

Okay, now we know the hash of the related tree and parent object as well as some information about the author, the committer and the commit message.

We download the tree-object and analyse it:

1
2
3
4
5
> git cat-file -p fa3887a0b798346c122afdd7c5ecc605bf3c18c0
040000 tree 532fc6055e09e0a2d5602f4b84c0dbadce1b5f3e        Dumper
040000 tree 077ce769dedcf19d0f063246256e8ae0394fd8df        Extractor
040000 tree d6e1bd4677a256e760cce5ddaa7db7ea6f9a8900        Finder
100644 blob 9670cf17dfeec351c395493058044b9f9dadbe2a        README.md

This tells us which files are stored in that tree. Note that Dumper, Extractor and Finder are also trees (directories). The final step is to download the README.md blob object and cat its content:

1
2
3
4
> git cat-file -p 9670cf17dfeec351c395493058044b9f9dadbe2a
Git Tools
=============
[...]

We need to take special care of packed files. We can find a list of all packs in .git/objects/info/packs

1
2
> cat .git/objects/info/packs 
P pack-e38660e6be24bb79d8d929ddea3d194e0dd3cd13.pack

The appropiate pack file is stored in .git/objects/pack/:

1
2
3
> /usr/bin/ls .git/objects/pack/
pack-e38660e6be24bb79d8d929ddea3d194e0dd3cd13.idx
pack-e38660e6be24bb79d8d929ddea3d194e0dd3cd13.pack

In that case, we need to download both files and then run the following command to extract the packed data:

1
2
> git unpack-objects -r < .git/objects/pack/pack-e38660e6be24bb79d8d929ddea3d194e0dd3cd13.pack
Unpacking objects: 100% (15/15), done.

As you can see, by doing this procedure recursively and for every possible hash, which we find in the already downloaded files, we can slowly restore the repository and extract the contents.

Sometimes downloading a specific object will fail, leaving us with an incomplete repository. In that case, we can use git fsck command to search for these missing/broken object files.

Tools

We’ve released our python/bash scripts used for this research on github: https://github.com/internetwache/GitTools

We used three different tools: A tool to discover, one to download and one to extract git repositories.

Preview of the recovery tool:

Scanning Alexa’s Top 1M

After running the ‘Finder’ on the Alexa Top 1M list, we found about 9700 public accessible Git repositories - that means that only <1% is prone to to this kind of attack.

Taking a look at the research data, we discovered the following mayor-effected business sectors and websites:

  • Big websites of german und US political parties / NGOs and a few governmental websites (.gov)
  • MTV-channels and radio stations (>20)
  • Online communities (one of it with >6 million users)
  • Trading websites (one with bitcoin and many other “banking websites”)
  • A very famous privacy online service
  • Soccer clubs of the german “Bundesliga”
  • Porn websites
  • Bigger and smaller online shops

It seemed like an accessible git repository was intended on some websites - mostly open source projects where the website’s sourcecode is available online.

The more the Alexa rank descended, the higher was the probability of finding a website which was affected by this issue.

Alexa rank and cumulated vulnerable websites

Here is an overview of the most prominent top level domains.

TLDs of affected websites

It’s interesting to see the distribution of protocols used. Especially that unencrypted protocols like http or git are used.

List of protocols used

A lot of vulnerable websites use either GitHub or BitBucket as the remote.

Most used repository hosting service

Most common branch names. It’s interesting to see some dev / develop branches here.

List with used branch names

On the other side, we had to hold our breath when we noticed that more than 100 projects used HTTP-Authentication for server-client communication. That means, that the protocol://user:password@host/repository combination is saved in the .git/config file, giving attackers access to the users (companies) GitLab-instance or GitHub/BitBucket account. With a bit of luck an attacker gets access to the CI-Server and then runs malicious code to further compromise your infrastructure.

Other than that, we’ve found a lot of AWS/Database/SMTP/FTP credentials in some repositories.

How to fix this issue?

First: GIT is a great tool - if you use it right. So not using GIT is no option, you should rather look at the access rights of your webserver - we have prepared a list to fix the issue.

It’s really easy to deny access to .git folders:

Apache

For 2.2:

1
2
3
<DirectoryMatch ".git*">
    Require all denied
</DirectoryMatch>

For 2.4:

1
2
3
4
<DirectoryMatch "^/.*/\.git/">
    Order deny,allow
    Deny from all
</DirectoryMatch>

Put that into your httpd.conf.

Nginx

1
2
3
location ~ /.git/ {
      deny all;
}

Put that as the first entry in your server-block in the nginx.conf file.

Lighttpd

First, you need to enable the mod_access module:

1
server.modules += ( "mod_access" )

After that, we can block access to the .git folder:

1
2
3
$HTTP["url"] =~ "^/\.git/" {
     url.access-deny = ("")
}

Put that into your lighttpd.conf.

Another approach is to use git’s --git-dir and --work-tree switches to move the git repository out of the document root.

Conclusion

There are a bunch of famous websites which do not deny access to the /.git/ folder - anyone may download their sourcecode and possibly other sensitive data. This issue isn’t hard to mitigate, so take a minute to make sure that your webserver isn’t misconfigured.

Stay safe, the team of internetwache.org

Comments