Maybe you had a lot of files scattered around on different drives, and you added them all into a single git-annex repository. Some of the files are surely duplicates of others.

While git-annex stores the file contents efficiently, it would still help in cleaning up this mess if you could find, and perhaps remove the duplicate files.

Here's a command line that will show duplicate sets of files grouped together:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --all-repeated=separate -f1 | \
    sed 's/ [^ ]*$//'

Here's a command line that will remove one of each duplicate set of files:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
    xargs -d '\n' git rm

--Joey

Very nice :) Just for reference, here's my Perl implementation. As per this discussion it would be interesting to benchmark these two approaches and see if one is substantially more efficient than the other w.r.t. CPU and memory usage.
Comment by http://adamspiers.myopenid.com/ Fri Dec 23 15:16:50 2011
note that the sort -k2 doesn't work right for filenames with spaces in them. On the other hand, git-rm doesn't seem to like the escaped names from escaped_file.
Comment by bremner Tue Sep 4 22:12:18 2012
problems with spaces in filenames

Spaces, and other special chars can make filename handeling ugly. If you don't have a restriction on keeping the exact filenames, then it might be easiest just to get rid of the problematic chars.

#!/bin/bash

function process() {
    dir="$1"
    echo "processing $dir"
    pushd $dir >/dev/null 2>&1

    for fileOrDir in *; do
        nfileOrDir=`echo "$fileOrDir" | sed -e 's/\[//g' -e 's/\]//g' -e 's/ /_/g' -e "s/'//g" `
        if [ "$fileOrDir" != "$nfileOrDir" ]; then
            echo renaming $fileOrDir to $nfileOrDir
            git mv "$fileOrDir" "$nfileOrDir"
        else
            echo "skipping $fileOrDir, no need to rename."
        fi
    done

    find ./ -mindepth 1 -maxdepth 1 -type d | while read d; do
    process "$d"
    done
    popd >/dev/null 2>&1
}

process .

Maybe you can run something like this before checking for duplicates.

Comment by mhameed Wed Sep 5 04:38:56 2012
Ironically, previous renaming to remove spaces, plus some synching is how I ended up with these duplicates. For what it is worth, aspiers perl script worked out for me with a small modification. I just only printed out the duplicates with spaces in them (quoted).
Comment by bremner Sun Sep 9 15:33:01 2012
Since the keys are sure to have nos paces in them, putting them first makes working with the output with tools like sort, uniq, and awk simpler.

Is there any simple way to search for files with a given key?

At the moment, the best I've come up with is this:

git annex find --include '*' --format='${key} ${file}' | grep <KEY>

where <KEY> is the key. This seems like an awfully longwinded approach, but I don't see anything in the docs indicating a simpler way to do it. Am I missing something?

@Chris I guess there's no really easy way because searching for a given key is not something many people need to do.

However, git does provide a way. Try git log --stat -S $KEY

Comment by http://joeyh.name/ Mon May 13 14:42:14 2013

Thanks. I have quite a lot of papers in PDF formats. Now I'm saving space, have them controlled, synchronized with many devices and found more than 200 duplicates. Is there a way to donate to the project? You really deserve it. Thanks.

@Juan the best thing to do is tell people about git-annex, help them use it, and file bug reports. Just generally be part of the git-annex community.

(If you really want to donate to me, http://campaign.joeyh.name/ is still open.)

Comment by http://joeyh.name/ Wed Aug 28 16:25:20 2013

I'm already spreading the word. Handling scientific papers, data, simulations and code has been quite a challenge during my academic career. While code was solved long ago, the three first items remained a huge problem. I'm sure many of my colleagues will be happy to use it. Is there any hashtag or twitter account? I've seen that you collected some of my tweets, but I don't know how you did it. Did you search for git-annex? Best, Juan

I used the following shell pipeline to remove duplicate files in one go:

(1) git annex find --format='${key}:${file}\n' \
(2)    | cut -d '-' -f 4- \
(3)    | sort \
(4)    | uniq --all-repeated=separate -w 40 \
(5)    | awk -vRS= -vFS='\n' '{for (i = 2; i <= NF; i++) print $i}' \
(6)    | cut -d ':' -f 2- \
(7)    | xargs -d '\n' git rm
  1. Generate a list of keys and file names separated by a colon (':').
  2. Cut out the initial part of the key so that the hash is at the beginning of the line. The -f 4- ensures that dashes in the filename do not result in truncation.
  3. Sort the entire list.
  4. Uniquify and print duplicates in groups separated by blank lines. Use the first 40 characters, which matches the length of a SHA1 hash. Other hashes will require a different length.
  5. Use awk to print all but the first line in each group. The empty -vRS sets blank line as the record separator, and the -vFS sets newline as the field separator. The for-loop prints each field except the first.
  6. Cut out the key and keep only the file name by relying on the colon introduced in the first step.
  7. Use xargs to separate file names by newline, which takes care of spaces in the file names. Send this list of arguments to git rm.
Comment by sameerds Tue Dec 31 06:24:06 2013
Comments on this page are closed.