Web ARChive
Filename extension |
.warc |
---|---|
Internet media type |
application/warc[1] |
Extended from | ARC[2] |
Standard | ISO 28500:2009[3][4] |
Open format? | Yes |
Website |
archive-access |
The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format[5] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.[6]
References
- ↑ "application/warc". Retrieved 5 March 2015.
- ↑ "Introduction". Retrieved 5 March 2015.
- ↑ "Information and documentation -- WARC file format". Retrieved 5 March 2015.
- ↑ http://www.iso.org/iso/pressrelease.htm?refid=Ref1255
- ↑ "ARC_IA, Internet Archive ARC file format". www.digitalpreservation.gov. Retrieved 2015-05-09.
- ↑ "WARC, Web ARChive file format". www.digitalpreservation.gov. Retrieved 2015-05-09.
External links
- http://archive-access.sourceforge.net/warc/
- http://bibnum.bnf.fr/WARC/
- http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
- http://www.netpreserve.org/publications/WARC_Guidelines_v1.pdf
Software
- Heritrix web archiver in Java
- wget (since version 1.14)
- WARC software library in Python
- warc-explorer, a Java tool to browse WARC archives
- ArchiveFS, a filesystem to mount WARC archives
- WSDK, a set of simple, compact, and highly optimized Erlang modules to manipulate (create/read/write) WARC files.
This article is issued from Wikipedia - version of the Tuesday, May 03, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.