Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code, but also wasteful.
So often, the tool used to manage the central repository, which needs to cleanly handle a large codebase, is different from the tool developers use for day-to-day work, which only needs to handle a small subset. At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis. This model seems to scale fairly well; Google has a big codebase with a lot of reuse, but all my git operations execute instantaneously.
Many projects can "shard" their code across repositories, but this is usually an unhappy compromise.
People always use the Linux kernel as an example of a big project, but even as open source projects go, it's pretty tiny. Compare the entire CPAN to Linux, for example. It's nice that I can update CPAN modules one at a time, but it would be nicer if I could fix a bug in my module and all modules that depend on it in one commit. But I can't, because CPAN is sharded across many different developers and repositories. This makes working on one module fast but working on a subset of modules impossible.
So really, Facebook is not being ridiculous here. Many companies have the same problem and decide not to handle it at all. Facebook realizes they want great developer tools and continuous integration across all their projects. And Git just doesn't work for that.
At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis.
At MS we also use Perforce (aka Source Depot), and I've toyed with the idea of doing something similar. Have you found any guides for "gotchas" or care to share what you've learned going this route?
I used git-p4 at my last job, and the only thing that ever got weird was p4 branches. At Google we have an internal tool that's similar to git-p4, and it always works perfectly for me. Enough developers are using it such that most of the internal tools understand that a working copy could be a git repository instead of a p4 client.
So if you're planning on doing this at your own company, my advice is to write your own scripts that make whatever conventions you have automatic, and to move everyone over at the same time. That way, you won't be the weird one whose stuff is always broken.
I think most people got burned by cvs2svn and git-svn and think that using two version control systems at once is intrinsically broken. It's not. svn was just too weird to translate to or from. (People that skipped svn and went right from cvs to git had almost no problems, I'm told.)
Eric Raymond talks about the problems of converting svn repos to git and is promising a new release of reposurgeon soon that handles svn well. http://esr.ibiblio.org/?p=4071
Facebook uses Subversion for its trunk, actually, and just gets developers to use git-svn. This issue is primarily a problem because git-svn is a lot more serious about replicating the true git experience (keep everything local) than Google's p4-git wrapper is. They really just need to be a little less religious about keeping everything local.
> Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code,
Can you expand on this? I would love to talk more about the "well known" part, I've never run across it before. I am a maintainer (tools guy actually) of a hg repo with about 120 subrepos, and the whole approach with subrepos is something that we're not thrilled about. Oh, and if you want to communicate via email, I'd be up for that too.
So often, the tool used to manage the central repository, which needs to cleanly handle a large codebase, is different from the tool developers use for day-to-day work, which only needs to handle a small subset. At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis. This model seems to scale fairly well; Google has a big codebase with a lot of reuse, but all my git operations execute instantaneously.
Many projects can "shard" their code across repositories, but this is usually an unhappy compromise.
People always use the Linux kernel as an example of a big project, but even as open source projects go, it's pretty tiny. Compare the entire CPAN to Linux, for example. It's nice that I can update CPAN modules one at a time, but it would be nicer if I could fix a bug in my module and all modules that depend on it in one commit. But I can't, because CPAN is sharded across many different developers and repositories. This makes working on one module fast but working on a subset of modules impossible.
So really, Facebook is not being ridiculous here. Many companies have the same problem and decide not to handle it at all. Facebook realizes they want great developer tools and continuous integration across all their projects. And Git just doesn't work for that.