Exporting to BibTeX from Zotero

As I explained in a previous post (in french), Zotero is a wonderful tool to manage bibliographies, but it is not the end of the story: in order to use these bibliographies in LaTeX documents, one needs to convert them to BibTeX from Zotero. And here the things get complicated: Zotero provides a set of translators, including one for BibTex, but it is far from being perfect. For this reason I have written a bash script which makes some cleanup in the BibTeX files generated from Zotero.

In my Zotero library I have a collection « Papers » where the subcollections are my various projects. Then I just have to export these various projects to a common folder (the variable $BIBPATH in the code below; in my case it is ~/Documents/Bibliographies/zotero/). Everytime I modify a collection, I export it again, and then I run again my script (saved under « ~/bin/bibtrim »): but note that this script has no effect on the files that have been cleaned already, so it is safe to run it several times on the same files.

#!/bin/bash

# in principle applying this a second time is without any effect

BIBPATH=~/Projets/Bibliographies/zotero

for file in $BIBPATH/*; do
    # delete abstract, file, urldate entries
    # delete "note" because it stores also "extra" 🙁
    sed -i -r '/(abstract|file|note|urldate|address|isbn|issn) =/d' $file

    # surround titles with brackets to protect from lowercase, but first remove added brackets
    sed -i -r '/title/s/\{([a-zA-Z]+)\}^,/\1/' $file
    sed -i -r 's/title = (\{.+\})/title = {\1}/' $file

    # for arxiv entries, replace url and journal by preprint fields
    sed -i -r 's/url = \{http:\/\/arxiv.org\/abs\/([a-z0-9/-]+)\}/archivePrefix = "arXiv",\n\teprint = "\1"/' $file
    sed -i -r 's/url = \{http:\/\/arxiv.org\/abs\/([0-9.]+)\}/archivePrefix = "arXiv",\n\teprint = "\1"/' $file
    sed -i -r '/journal = \{+arXiv/d' $file

    # replace the protected dollar
    sed -i -r 's/\\\$/$/g' $file
    # fix latex formula (only in title)
    sed -i -r '/title/s/\{\\textbackslash\}/\\/g' $file
    sed -i -r '/title/s/\\\{/\{/g' $file
    sed -i -r '/title/s/\\\}/\}/g' $file
    # note that we can not define set in titles

    # delete url if doi is present (works only if it is the line just before)
    # http://www.theunixschool.com/2012/06/sed-25-examples-to-delete-line-or.html
    sed -i -n '/doi =/{x;/url =/d;x;};1h;1!{x;p;};${x;p}' $file

    # delete empty url field (e.g. because of arxiv deletion)
    sed -i -r '/url = \{\},/d' $file
    # delete all brackets when there are more than 2
    sed -i -r 's/\{{3,}(.+)\}{3,}/{{\1}}/' $file

    # remove leading blank lines
    sed -i -r '/./,$!d' $file
done

As you can see all the script relies on sed to remove or replace some contents. Here are a summary of the different operations:

  • First delete some useless content (or poorly filled – see below for arXiv), like abstract, address, etc. A side-effect is to decrease the file sizes.
  • Double the bracket around titles to protect the uppercase letters;
  • It is useful to define the fields « archivePrefix » (to arXiv) and « eprint » (to the paper id) for arXiv entries in order to display them correctly with modern BibTex styles (they display ArXiv followed by the identifier with an url). These entries are defined from the arxiv url, which are themselves removed after that to avoid duplication.
    Then the journal entry is removed for modern articles (with id as xxxx.xxxx) because they contain only information about the arXiv eprint, and not on the final publication.
    Note that I do not save the primary class since it is not useful to construct the url, and it is not possible to get it (for the moment) for modern id.
  • Fix LaTex formula in the title (symbols like \ and $ are protected when exporting, but we don’t want that).
  • Delete the url field if the doi field is present (to avoid duplication).
  • Finally clean up the garbage that previous commands made (empty url fields, more that two brackets, etc.).

The main problem with arXiv entries (with modern id) is that the publication data (journal, etc.) are stored into the extra field, exported into notes. So they are removed with the first command, while it would be interesting to keep them: but it is quite difficult to convert them from the extra because it is just a string of characters.

Update 12/07/2014: the previous script was introducing a new line at the beginning of the file and deleting the before last line after each run. This bad behavior came from the code used to delete the url fields that came before doi. Now this problem is fixed. I have also added a line to delete leading blank lines.

Update 11/08/2014: there was some problems with the brackets in title, so now all brackets inside the title are removed.

Update 17/04/2016: Here is a new version of the script that is intended to work with Better Bibtex. I am not replacing the previous version because I did not test if this was compatible with normal bibtex export.

#!/bin/bash

# in principle applying this a second time is without any effect

# give the location of the bibtex files to process
BIBPATH=$1

# if no location is given, define a default one
if [[ -z "$BIBPATH" ]]; then
    BIBPATH=~/Projets/Bibliographies/zotero
fi

for file in $BIBPATH/*; do
    # delete abstract, file, urldate entries
    # delete "note" because it stores also "extra"
    sed -i -r '/(abstract|file|note|urldate|address|isbn|issn|keywords) =/d' $file
    
    # for arxiv entries, replace url and journal by preprint fields
    sed -i -r 's/url = \{http:\/\/arxiv.org\/abs\/([a-z0-9/-]+)\}/archivePrefix = "arXiv",\n  eprint = "\1"/' $file
    sed -i -r 's/url = \{http:\/\/arxiv.org\/abs\/([0-9.]+)\}/archivePrefix = "arXiv",\n  eprint = "\1"/' $file
    sed -i -r '/journal = \{+arXiv/d' $file
    
    # replace "arxiv" by "eprint" (compatibility with biblatex)
    sed -i -r 's/arxiv = /eprint = /' $file
    
    # replace the protected dollar
    sed -i -r 's/\\\$/$/g' $file
    sed -i -r 's/\\textdollar\{\}/$/g' $file
    # fix latex formula (only in title)
    sed -i -r '/title/s/\{\\textbackslash\}/\\/g' $file
    sed -i -r '/title/s/\\ensuremath\{\\backslash\}/\\/g' $file
    sed -i -r '/title/s/\\\{/\{/g' $file
    sed -i -r '/title/s/\\\}/\}/g' $file
    sed -i -r '/title/s/\\\^\{\}/^/g' $file
    
    # delete url if doi is present (works only if url is the line just before doi)
    # http://www.theunixschool.com/2012/06/sed-25-examples-to-delete-line-or.html
    sed -i -n '/doi =/{x;/url =/d;x;};1h;1!{x;p;};${x;p}' $file
    
    # delete empty url field (e.g. because of arxiv deletion)
    sed -i -r '/url = \{\},/d' $file

    # delete inner brackets and surround titles with brackets to protect from
    sed -i -r '/title/s/\{\{//g' $file
    sed -i -r '/title/s/\}\}//g' $file
    sed -i -r 's/title = (.+),/title = {\1},/' $file
    
    # delete all brackets when there are more than 2
    #sed -i -r 's/\{{3,}(.+)\}{3,}/{{\1}}/' $file
    
    # remove leading blank lines
    sed -i -r '/./,$!d' $file
    
    # convert video type
    sed -i 's/@video/@misc/' $file
    
    # delete line with bibtex key (if export in bibtex and not better bibtex)
    sed -i -r '/bibtex:/d' $file
done

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *