Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Git partial clone lets you fetch only the large file you need (about.gitlab.com)
229 points by moyer on March 14, 2020 | hide | past | favorite | 86 comments


There is one note piece to the puzzle to make git perfect for every use case I can think of: store large files as a list of blobs broken down by some rolling hash a-la rsync/borg/bup.

That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.

git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.

Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.


Do ISOs and other large blob types support only partial (block) modification? Wouldn't all subsequent blocks change too?


Sometimes they do - e.g. if you replace a file in the ISO that is the same size up to block alignment, which is common when e.g. editing a text file or recompiling an executable with a minor change. They almost always do when it's a VM image representing a disk - only some blocks change every write.

However, with self synchronizing hashes of the kind used by rsync bup and borg, it doesn't matter - you could have a 1TB file, delete a single byte at position 100 - and you only need to store or transfer one new block (with average size 8KB for rsync, configurable for borg) if you already have a copy of the version before the change.

It's somewhat comparable with diff/patch but not exactly; it's worse in that change granularity is only specified on average; It's better in that it works well on binary files, does not require a specific reference diff (can reference all previous history), and efficiently supports reordering as well small changes - if you divide a 4000 line text file to four 1000-line sections and reorder them 1,2,3,4 -> 3,1,4,2 you will find the diff/patch to be as long as a new copy, whereas a self synchronizing hash decomposition will hardly take any space for the reordered file given the original.


Oh, I used rsync many times but I thought it simply retransmits changed files. (Oh, it needs the --checksum argument to do this, okay.)

So how do these self-synchronizing hashes work? Like a Merkle Tree? (Ah, okay https://en.wikipedia.org/wiki/Rsync#Determining_which_parts_... )

So rsync uses 8KB for chunk size, so for a file 1GB it has 125 000 chunks. (And if every chunk needs 16 bytes of hash data to send, that's about 2MB, pretty darn efficient, especially if it can spot reorders.) Though according to Wikipedia it only does this if the target file has the same size, so adding new files to ISOs might not work in case of rsync, but still, the possibility is there for diff algos and version control systems.


No, target doesn’t have to be same size. As an optimization, if size and datetime are the same, rsync will assume no change and will not hash at all (though you can force it to).

But it will definitely use hashes when size differs (unless forced to copy whole files, or copying between local file systems)


It really depends on the type of file. ("Other large blob types" is a rather broad category.)

One obvious example where you could have a lot of common blocks (even following the offset where a change was made) is zip files. The zip format basically compresses each file individually and then concatenates all that together.

Let's say you have a build and it packages the results up as a big zip file. (Java builds often do this. A jar is a special type of zip file.) If you change a few source files and rebuild, and if your build is deterministic (and/or incremental), then the new zip file will contain a lot of the same stuff as the previous version. And if your zip archiver is deterministic (pretty safe assumption), it should produce a zip file that is mostly the same sequences of bytes as the previous zip file, even if there are changed files in the middle.

If you write a .tar.gz archive, then one change in the middle will throw everything off from that point on because it compresses the whole archive instead of individual files. In theory a binary diff can work around this by first undoing the gzip that was done to create each large blobs, then doing a binary diff on that, and then arranging to be able to recreate what gzip did. Obviously that's messy.

Of course, not every file is an archive. Some are filesystems. But any writable filesystem (notably not including ISOs) that is capable of being used on a hard disk will of necessity not rewrite everything. If it did, changing on one file on a filesystem would take hours because the rest of the partition would have to be rewritten.

Another obvious type of big blob is multimedia. I don't know a lot of specifics, but I would think file formats meant for editors would keep changes localized for reducing IO (for example, so that changes in a non-linear video editor don't need to write a giant file), but formats meant for export and delivery might change the whole file since they're aiming for small size.


So ZIPs don't have any "global" directory thing? :o


They don't have a global compression dictionary thing.

Similar effect can be achieved with gzip --rsyncable, which IIRC resets the dictionary based on a rolling sum.


They have a non essential copy of the directory at the end for spoed; tools exist to rebuild it if it is corrupted from the entries inside the file. But it is usually very small (the only real life exception I met is the hvsc archive where the directory size is very significant - so they zip it again)


Has anyone used Git submodules to isolate large binary assets into their own repos? Seems like the obvious solution to me. You already get fine-grained control over which submodules you initialize. And, unlike Git LFS, it might be something you’re already using for other reasons.


Using submodules require that everyone on your team has at least a vague idea of what's going on and how to not foot-gun themselves. That's hard enough with git itself. I don't think I've ever seen submodules used without become a major pain point.


That is a straight up nonstarter.

Someone was trying to talk me into git subtrees though...


The problem with git submodules is they can't be used like a hyperlink to another repository. Updating the submodule requires updating the superproject as well. The new commits are invisible to the superproject until that is done.

It'd be great if they worked like Python's editable package installations.


Then the state of the superproject would depend on when the checkout occurred. That would be disastrous for consistency, you’d be unable to replicate a checkout later or elsewhere. The state of a repo after a checkout should only depend on the commit that was checked out.


It’s interesting that we’ve never developed the equivalent for Git of what every programming-language ecosystem has: keeping two parallel listings of dependencies, one in terms of version constraints to satisfy, and the other in terms of exact refs.

I could totally see a .gitmodules.reqs file specified in terms of semver specs against tags, or just listing a branch to check out the HEAD of; resolving to the same .gitmodules file we already have. Not even a breaking change!


It would mean attaching a semantic meaning to tags, but git doesn't do that, ever, for any reference. You don't even have to have a master branch, much less tags that follow semver. Linux doesn't even use semver!


Correct, this feature should be built on top of the source control system, not as part of it.


This is great for vendoring external dependencies that aren't under the developer's control. When the same developer is working on several related but separate projects at the same time, it's too cumbersome.

Would be nice if git submodules could also point to a branch instead of specific commits. That way, the superproject's state would not be modified every time the branch is updated.


If your software is small enough that a small handful for engineers can keep the whole thing in their head, you probably don't need submodules. If you have different teams working on different parts of the project, submodules start making more sense.


They can now, with the new-ish submodule update/init --remote. But the problem with sub modules is that you cannot do a shallow fetch (depth 1) because most hosts won’t serve unadvertised refs.


Submodules are almost always the wrong answer. If you need to version huge files, use Git LFS.


I have tried sub modules but it’s way too easy to shoot yourself in the foot. Not very sustainable in a team with different levels of git knowledge.


I’ve done that. Especially if you want specific versions of data to build ML models, this makes a nice audit log for reproducibility


Also known as workspace views in P4.

It's interesting to see the wheel reinvented. We used to run a 500gb art sync/200gb code sync with ~2tb back end repo back when I was in gamedev. P4 also has proper locking, it is really the is right tool if you've got large assets that need to be coordinated and versioned.

Only downside of course is that it isn't free.


> Only downside of course is that it isn't free.

Another downside is that it consumes insane resources (our servers are in the dozens of TiB of ram, with huge NVMe based storage arrays directly attached)

Another downside is that you have to maintain connection to p4 to do any VCS operations (stashing included).

Another downside is that branches are very "expensive" (often taking days) and are impossible to reconcile. We never re-merge to MAIN.


This kind of comment isn't helpful. Of course, there have been ways to copy large files around since there were networks. What's new in this protocol enhancement is that this works within the context of a Merkle tree-based technology (upon which all DVCS's are based). To use your analogy, yes this is a wheel but it's built with rubber instead of wood and iron.


I guess I should have expanded more.

DVCS is in direct opposition of workflows that include binary files(yes I'm aware that git lfs has locking, it's also centrally orchestrated) because you can't merge almost every binary format.

We were using P4 ~15 years ago for these workflows and rather than understanding what made them work people are just rediscovering the same problems that have already been solved.

My guess is that we'll next see a solution that dynamically caches most downloaded files in a geographic friendly way, heck we may even call it "P4Proxy".

I've seen so much FUD around how git is the "one true workflow" because other solutions "don't scale" when they don't understand the constraints that certain workflows impose. Git/DVCS is great for a lot of things but sometimes you should use the right tool for the job rather than hack something together.

[Edit] These reasons are exactly why you see Unreal supporting P4/SVN out of the box[1] and no mention of git.

[1] https://docs.unrealengine.com/en-US/Engine/UI/SourceControl/...


I think this is a very p4 centric view of the world.

Locking helps with preventing collisions, but honestly the issue is still always communication. Why are people even touching files they shouldn't be touching?

Meanwhile perforce is a pain for code heavy projects and requiring a central perforce server. Git works great there.

The issue is neither is a silver bullet for the others workflow and needs, and they both suck horribly for mixed code and binary asset workflows.

That's not even considering cost.

Meanwhile film studios generally prefer keeping the considerations separate and using symlinks or URI to their data store and that works really well. But that doesn't work great for remote workflows.

So again, I think you're applying a very p4, game centric view to this. There are lots of different use cases and team structures that none of these version control systems are able to address in their entirety.


> but honestly the issue is still always communication. Why are people even touching files they shouldn't be touching?

Because there's 300+ people working on a project, and it's not feasible to know what every other person is working on or planning on working on.

The file lock (code can be merged too, just like git, so this is only really for binary assets) is a crude communication tool saying "hey I'm using this file".

> Meanwhile perforce is a pain for code heavy projects and requiring a central perforce server. Git works great there.

I think you're applying a git biased view here. I don't think perforce is unsuitable for code heavy projects at all (see workspace views as a prime example), and for the majority of people, having a central server isn't an issue. Most people treat github/gitlab as a centralised server anyway. I've _never_ in a decade of programming heard someone suggest adding an extra remote to git so I can share your changes, it's always been "push it as a separate branch and I'll merge it". If you have a missing internet connection, you're likely not able to share code _anyway_, and with p4 you can always reconcile offline work when you're back online.


Meanwhile film studios go on with 300+ people without hitting the issue of people hitting the same asset files without needing locking.

I think locking is a fine utility to have but I think a lot of workflows use it to workaround poor communication.

And I think you're constraining your views of git to just your workflow.

I've worked in a lot of scenarios where you need multiple remotes such as having an internal repo and an external one.

And similarly there are lots of scenarios where having a decentralized copy of the report is very useful for being able to work in offline scenarios and compare multiple branches. Things like when commuting on a plane or being in low connection areas.

I don't see how my view is git centric. I'm saying each VCS has very useful areas and equally big rough spots. The problem is that each VCS group believes there's is the only right system.


If by P4 centric you also mean SVN as well then sure.

I will say however that if you think locking is optional then you already don't understand these workflows and why they're so critical. Art/design/animation doesn't care that "they should not have been touching the file" they just care that they have to throw away two days of work because someone made multiple edits to the same package file. I've literally seen multiple teams almost come to blows when this happens.

You can separate code from art/assets. That comes at an integration and iteration cost, it'll drive your designers mad.

There are workflows out there where code is not the first class citizen, in those cases I've seen git shoehorned in and untold pain follows.


I've worked as a pipeline supervisor at one of the biggest VFX studios. I very much understand these workflows. I've worked as an artist and a tech artist in perforce and SVN workflows too. I've had to support the workflows of a 1000+ workforce across multiple locations.

I don't think attributing my disagreement with you to not understanding workflows is a fair characterization.

I still think locking , while useful, is only mandatory for work cultures with poor communication. Otherwise, many companies get by with very large workforces who don't hit these issues without having locking.

Separation of code and art assets also don't need to be painful. It's very doable but does require some amount of architectural consideration.

And I very much acknowledge there are projects where code isn't the majority makeup, which is why I say that none of the VCS systems cover mixed projects well or cover all the needs of the others well.


I think we'll just have to agree to disagree.

It sounds like we just come from different development cultures. Your solution to lack of locking sounds like a top-down hierarchy that wouldn't be flexible enough to support the teams I've worked with.

Having seen both approaches(and how they break down) I'll take a centralized locking solution over communication mistakes that lead to days of work being lost.


That's fair to agree to disagree but I also again think it's unfair to characterize my pipelines as not flexible enough.

For context I developed the publishing pipelines for the majority of departments in the studio. I have several hundreds of assets being published through my pipelines on a daily basis, if not more, from a variety of departments.

We only hit collisions on a very rare basis, and which were often resolved in an hour or two in the worst case.

We've scaled this from very small teams to large ones, from very scrappy realtime productions to feature length offline rendered films.

I don't doubt that locking helps. I just argue that maybe it's not as critical as people make it out to seem.


What happens if you're not around to drive the process? What about if you don't have the organizational backing to drive the process? What if a team goes AWOL or isn't bought in to your process? I've seen variants of all those happen in production in one form or another.

At the end of the day humans make mistakes, especially when involving communication. I'd rather have a physical system that prevents breaks instead of requiring cross-team/cross-discipline coordination.

Maybe gamedev is much more coupled than film(we regularly had design, animation, art and code touching the same common core packages). Look at Unreal or any other gamedev pipeline and you'll see a bias for locking source control solutions.


It's very rare that someone needs to be around to oversee the process. Tooling guides the vast majority of users in to a workflow that works while still being flexible should they need it.

Lack of organizational backup, do you mean cultural from the studio or infrastructure? Both are a problem no matter what solution you pick.

If a team goes AWOL, that's on them. The tooling usually allows for some amount of arbitrary workflow but they can't go completely off the rails. But that's true of p4 too. So I think that scenario would have to be more specific.

And yes people make mistakes, and you need tooling to guide them. Locking is a tool, but it's not the only tool. I feel very much that many workflows use it to hide deeper issues. That's not to say it's not valid, it is, but it's not a panacea either.

Unreal heavily favors perforce and SVN because that's what it was designed around. There's no absolute reason it could not work with other versioning systems and their paradigms if it came to being necessary.

Unity on the other hand is quite happy to work with any version control system, and works quite well with git or perforce.

You again seem to be trying to approach this from the angle of only the system you're familiar with working. But maybe try stepping outside the box and seeing if your workflow isn't a byproduct of your tools.

After all, you were asking git users to look at perforce as the solution. I don't think it's fair to then go ahead and assume that p4 is the only workable solution.


Oh I've been working with git for ~9 years now, it's not a lack of familiarity.

Take AOSP, even Google had to overlay the repo[1] tool to scale past git. It's a hot pile of garbage that won't let you sync all repos to a specific point in time. Not to mention the nature of cross-repo commits are not atomic. Good luck bisecting a breaking change across millions of lines of code and build files.

I've spent over a week chasing down how some homespun tool for storing binary assets side-by-side with git works so I could get a single file into a build.

Last company I was at which was a leader in the Android space just put the whole thing in P4, branch per device and it worked without many major issues. Pulling source took 1/50th the time a repo sync took. Literally an A/B comparison of one tech vs the other. That's before you even start to consider prebuilts.

Like I said, I think we're just going to have to agree to disagree and leave it at that.

[1] https://gerrit.googlesource.com/git-repo/


Hi, I work at Google and replied to you way up-thread. I built a CI system used by ChromeOS based on Repo and even contributed some changes to it. While I don't like it much, it is useful. You misunderstand or are misinformed about many aspects of it.

> Google had to overlay the repo[1] tool to scale past git

It was created to allow for a forest of git repos to all coexist in a world in which git submodules wasn't suitable yet OR the repos spanned security domains. However, almost all of the shortcomings of submodules have been addressed and so—at least—the team that I lead now is considering migration to it from Repo.

> cross-repo commits are not atomic

Yes, that is a feature. But I think you meant that there's no cross-repo coordinate in the timeline to sync to. However, there is. That's exactly what a Repo tool manifest snapshot is. Our CI system ensured that change that had deps across repos were committed and a Repo manifest snapshot only was taken with all inter-commit deps satisfied.

> Good luck bisecting a breaking change across millions of lines of code and build files

The team that I led implemented this. We simply snapshotted the forest at every T time intervals. For bisection, we walked the snapshots. Once a specific manifest snapshot was identified as the culprit, we further bisected within repos for a specific change.

> Pulling source took 1/50th the time a repo sync took.

Yes, that's what the partial clones (the article you replied to) and sparse checkouts solves. Once these two things are widely available, I don't see any benefits to P4 remaining.


None of the things you're talking about exist within the repo tool, it sounds like these are all things that you had to layer on top with a separate CI system.

It's been 12 years since the Dream was released and we're still not to the same level of perf/features as just stuffing the whole of AOSP in P4. I get that git has advantages, and it's an awesome tool when used appropriately but the desire to use it to solve every SCM problem under the sun is a bit misguided.


> Why are people even touching files they shouldn't be touching?

I don't think it's accurate to say they shouldn't be touching the files.

It's possible to make two unrelated changes in the same binary file just as it is possible to make two unrelated changes in a source file (or other merge-friendly text file).

Just as there may be nothing wrong if one person changes the function foo() in a file and someone else changes the function bar() in that same file, there may be nothing wrong if one person opens a CAD file and makes a change to a drawing in one part and another person makes a change to a different part of the same drawing.

In that case, they could coordinate by communicating (even though their tasks are unrelated) but then they're just doing the same thing as locking the files but manually and informally (and probably inconsistently) without the benefits of automation.

Of course there are times when locking catches failures to communicate, but that doesn't mean that that's what locking is for.


Sure, but I would argue it's a poor system that is setup to require serial editing to a non diffable file.

Again, I'm not saying locking is an invalid solution. It is. But to me, it's often (but not always), a crutch for a deeper issue.

That binary file should be set up to be modular if it is intended to have areas that multiple users can touch without directly affecting each other.


I love P4 for just working, but I absolutely can't stand the limited shelving ability. I ended up writing a helper program that lets me shuffle local changes off to a git repo just so I could manage working on several overlapping changelists. Perforce would be so much more usable if they would include this sort of basic functionality right out of the box. The thing git gets right is that you often need to juggle several threads of change at the same time, and those threads may have complex branching as you try out different approaches and combine the best pieces at the end.


P4 was great for its time if you could pay for it, but it is definitely not a competitor anymore.

Git LFS has been just fine for multiterabyte repositories for years.


p4 is still great, that there are workarounds to be usable on similar workloads on git doesn't takeaway p4s inherent advantages. they're two very different tools.


Thank you, that phrases it much better than I could. They are very different tools(although I hear good stuff about P4 Fusion).

Now if someone wanted to make an open source P4 replacement that would be a neat thing.


P4 is fine on paper. I just wish the the client didn't crash and the server didn't lock up as often as it does. P4V is a mess.


This is interesting and could be a savior for Machine Learning(ML) engineering teams. In a typical ML workflow, there are three main entities to be managed: 1. Code 2. Data 3. Models Systems like Data Version Control(DVC) [1], are useful for versioning 2 & 3. DVC improves on usability by residing inside the project's main git repo while maintaining versions of the data/models that reside on a remote. With Git partial clone, it seems like the gap could still be reduced between 1 & 2/3.

[1] - https://dvc.org/


Also --reference (or --shared) is a good parameter to speed-up cloning (for build, for example), if you have your repository cached in some other place. I was using it a long time ago when I was working on system that required to clone 20-40 repos to build. This approach decreased clone timings by an order of magnitude.


Do you actually need clones in that scenario? I worked on a build system that grabbed source from several hundred repos at the starting point, and it turned out to be way faster to just grab it all as tarballs with aria2c.


Grapping the tarbell from where? To my best knowledge, tarbell export is not a part of git, but something git hosts provide.

Git is a distributed VCS, and we should support keeping it that way.


Almost any project you work on will have an authoritative copy of the repo in some kind of web-accessible tool, most of which provide a tarball-download function.

And GitHub's scheme is pretty much a de-facto standard at this point—GitLab's implementation is an exact copy of it, for example:

    https://<host>/<org/project>/archive/<ref/branch/tag>.tar.gz
Edit to add: Also, git-archive --remote is actually most of the way there, but it's not an HTTP download, of course. :(


GitHub doing something one way and GitLab copying it doesn't make a standard.


Careful, with extra large repositories it actually slows down the cloning while, obviously, significantly reducing the space usage.


That seems quite useful, though Git LFS mostly does the job.

One of my biggest remaining pain points is resumable clone/fetch. I find it near impossible to clone large repos (or fetch if there were lots of new commits) over a slow, unstable link, so almost always I end up cloning a copy to a machine closer to the repo, and rsyncing it over to my machine.


What’s your take on this line?

> Partial Clone is a new feature of Git that replaces Git LFS and makes working with very large repositories better by teaching Git how to work without downloading every file.


I believe partial clone makes the situation a little better, but it's not nearly as good as resumable cloning, because you have to partition your repo in advance.


This is great. We use get lfs extensively, and one of the biggest complaints we have is users have to clone 7GB of data just to get the source files. There's a work around in that you don't have to enter your username and password from the lfs repo, and let it timeout, but that's a kluge.


There’s an option for that: GIT_LFS_SKIP_SMUDGE=1 git clone SERVER-REPOSITORY


In the AAA games industry git has been a bit slower on the uptake (although that’s changing quickly) as large warehouses of data are often required (eg: version history of video files, 3D audio, music, etc.). It’s nice to see git have more options for this sort of thing.


Surprised this new idea doesn’t support object storage. Sounds like Git LFS would still be the right way to go for repos with assets for games like meshes, sounds, etc.

However I’ve heard many studios use Perforce instead. However not being open source is a downside to some, but I don’t really know too much about it personally.

Then if working with a lot of non code files, sounds like some solutions have locking. I guess not two people could edit the same Blender or PSD file at the same time and then merge them later on.

Kinda wouldn’t surprise me if some companies actually run multiple versioning control systems. Code on one system, game assets on another.


I think in terms of game production, software licensing usually isn’t the largest cost center for a project. Proprietary software isn’t a concern as much, given that games traditionally are “shipped” and then completed. (Note that this changes as games that are more online service-based with live operations, rather than a specific release date and a “final” copy sent for production; the internet has changed things a lot)

You’re more right than you think about multiple versioning systems, although keeping synchronized becomes an issue. Perforce is a bit of a boon for management, as they get a GUI for versioning across a multidisciplinary team.


Git LFS has been a thing for years, though.


You’re absolutely right, but larger developers and publishers have been slower to adopt.

P4’s GUI/model is also intuitive for non-programming roles to learn and use historically compared to git, so a team with wide skills can ramp up quickly with a unified toolset. A less-technical manager gets a GUI that has versioning across changes from a multidisciplinary team. You can probably guess what inertia that has in a space with higher turnover compared to other industries.

As mentioned, things are changing though. git and GitHub have become a mainstay and are what new programmers likely learn in schools. This has a trickle effect on new projects with smaller teams and results in more investment into git setups. I use git in a AAA context at work, and it’s not uncommon to find sentiments from more seasoned game programmers on git that are similar to HN comments about the latest fad in web frameworks.


The problem is that you blamed Git, rather than your legacy workflows and corporate culture.

And again, you keep blaming Git here, now around a lack of intuitiveness and a lack of GUI. Again, Git has multiple GUIs around to choose from and multiple integrations with almost any editor and IDE you can think of, some meant for beginners and trivial usage.

And no, things are not "changing" and Git is not to be compared with a "web framework fad". Git became the version control system more than 5 years ago.


Sorry if my tone came off as blaming like you said. I didn't mean to be blaming git or holding it responsible for something. I see a lot of existing inertia for Perforce in the AAA game development industry, and wanted to express that.

I think you're correct in that git is the go-to version control software. I'd reach for it as a default tool every time. I do work with older programmers where the majority of their careers have been in Visual C++ and Perforce, and I've definitely heard sentiments seeing it as wheel-reinventing. I don't agree with them, but it's what I've experienced.


So I fully agree with you, but one area for more artist heavy workflows where git still struggles is ease of use.

The two biggest issues are: - git lfs doesn't automatically identify large files or binary files. So it's very easy for even experienced engineers to have set up lfs but forgotten to track a file or extension

- git exposes too much of its internals. It's really cumbersome for artists even with UI tools.

That's not to fault git as a technology but I think there's a place for an artist friendly layer on top of git, perhaps a very artist centric UI and set of tools and workflow guides


This could actually be a really good solution to the maximum supported size of a Go module. If you place a go.mod in the root of your repo, then every file in the repo becomes part of the module. There's also a hardcoded maximum size for a module: 500M. Problem is, I've got 1G+ of vendored assets in one of my repos. I had to trick Go into thinking that the vendored assets were a different Go module[0]. Go would have to add support for this, but it would be a pretty elegant solution to the problem.

[0]: https://github.com/golang/go/issues/37724


That does sound like a "you're holding it wrong" issue. As one of the Go team members pointed out, defining a separate module is not a hack, but the intended way of doing it.

How would a partial checkout help?


Go modules are built around git, unlike many other languages package systems. That means you don't get to pick and choose what goes into them. Imagine if you had to put an empty package.json in every (non-node) directory of your git repo to exclude it from an NPM package, or an install.py in every (non-python) directory to exclude it from a PyPI package. Multi-language repos would get ridiculous pretty quickly.


> Go modules are built around git

Not really. Modules are specced based on zip files and metadata in text files. There's just support for extracting that data from git repos transparently.

Here's a slightly out of date write-up: https://research.swtch.com/vgo-module


This is only an issue if you put a go.mod file at the top level of your repo. We have monorepos with hundreds of modules.


I started a project recently and for the first time ever I've wanted to keep large files in my repo. I looked into git LFS and was disappointed to learn that it requires either third party hosting or setting up a git LFS server myself. I looked into git annex and it seems decent. This, once it is ready for prime time, will hopefully be even better


Is it possible given a git repo (hosted on say GitHub) to only 'clone' (download) certain files from it? Without `.git`


I believe you're looking for the 'working tree' only. You could do the following:

git archive --remote=<your-URL> | tar -t

source: https://stackoverflow.com/questions/3946538


If you only want a subset of the repo's files, you can use Github's Subversion interface: https://stackoverflow.com/a/18194523


Short answer is, not easily: https://stackoverflow.com/a/14610427

You can get the most recent tree for a repository (no history, just the current state of the repo) with `git clone --depth=1`. That's often good enough for slow connections.


I'm still unconvinced. Will this provide a user friendly approach to managing design assets.


My impression is that it will use the normal git experience managing design assets. Ie. with this there should be no need for additional tooling. If it works, that would be so great!


> One reason projects with large binary files don't use Git is because, when a Git repository is cloned, Git will download every version of every file in the repository.

Wrong? There's a --depth option for the git fetch command which allows the user to specify how many commits they want to fetch from the repository


depth is broken; it cannot be used for submodules/recursive submodules dependably because most hosts will refuse to serve unadvertised refs. We learned this the hard way. Or maybe it is submodules that are broken. Or git itself.


Depth, submodules and multiple work trees are all half-baked features that work fine up to a point, then start falling over frantically—most notably if you try to use them together.


depth just let you control the amount of history. It will not let you exclude files that are at the highest depth that you don't want. So while that statement was not accurate, it's not what this feature is intended for.


Yes, but 95% of devs, even fairly talented ones, don't really know how to use Git.


Author seems to be a manager, not necessarily a dev.


Git is fundamentally very simple. Any dev who doesn't understand exactly how it works is not even remotely talented.


In AWS, it's worth considering putting those large files in an S3 bucket.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: