I recently discovered that Library of Congress authority data sets for books, subjects, names, and other bibliographic elements are available for download as open-source, free data for research.
The process of uploading the files into Koha was quite straightforward. I chose to begin with their subject file, which contains approximately 320,000 entries. The files are located here: https://www.loc.gov/cds/products/marcDist.php
- Download the Marc8 files
- For each, change the file extension from .marc8 to .mrc
- Open a file in MarcEdit via its Batch Edit and choose MARCSplit:
4. You can then split the file into manageable pieces. I chose to split into files of 10,000 records. I have also done files of 20,000. (Koha import times on 10,000 records is about 5-10 minutes)
5. Reindex Zebra after uploading all your files
sudo koha-rebuild-zebra -f -v library2
CAUTION: The subject data set also contains free-floating subdivisions. They are MARC records in their own right. As of Koha 18.11, if you navigate to Search Authorities and attempt to click on the Details link of a subdivision entry, you will get an Internal Server Error. This may be a bug. Update
After clicking on Submit, you are presented with the enter authority file in Koha. In my case, the first 40 pages or so are free-floating subdivision. You can tell because there is not subheading within each entry before the term (i.e., “Topical Heading”, “Corporate Name”, etc.):
If you proceed to click on Details, a Plack error (I think) occurs: Internal Server Error. Update: This issue has been resolved
Record Matching Rule for Importing Authorities
CAUTION: I created a record matching rule for authorities upload in order to prevent duplication during upload, based on the information in the Koha 18.11 manual: https://koha-community.org/manual//18.11/en/html/administration.html#record-matching-rules
I created a match rule on the authority’s tag 010$a, or LCCN number, with a match threshold of 100. Then, I created fail-safe match sets using each of the following indexes:
The match rule then works in this way: If the record to be imported has a value in tag 010$a that is identical to the tag 010$a value in the authority record already in Koha, then a match is found (or, 101 points, which is above the match threshold of 100) and the record is not imported (as defined by the import rule in Koha Staged Record Management). For any reason, if a match is not found on tag 010$a, then check the other indexes (as above) to see if a match exists.
The benefit of having the subdivisions as part of your authority file is that you can control them in your MARC framework as you would the 650$a field. Neat!