Draft:WebArchiver

Free and opensource tool to hep preserve web content for the future From Wikipedia, the free encyclopedia

WebArchiver is a free and open-source web archiving tool that creates, reads, replays, and manages WARC files. It provides a web-based graphical interface and REST API for capturing websites, viewing archived content directly in the browser, and downloading standard WARC files. The project was created with the goal of providing a no-cost alternative to commercial web archiving services, allowing users to work with the WARC file format without requiring paid software.^[1]

Review waiting, please be patient.

This may take 2 months or more, since drafts are reviewed in no specific order. There are 4,063 pending submissions waiting for review.

If the submission is accepted, then this page will be moved into the article space.
If the submission is declined, then the reason will be posted here.
In the meantime, you can continue to improve this submission by editing normally.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Reviewer tools

Instructions · What links here · WebArchiver (talk: + · bio) · (log) · Copyvios report · reFill · Citation Bot · (Search: Google, Wikipedia) · Submitted 4 days ago by MulderCW (talk: D · +) · Last edited 4 days ago by MulderCW

Comment: In accordance with Wikipedia's Conflict of interest guideline, I disclose that I have a conflict of interest regarding the subject of this article. MulderCW (talk) 13:40, 9 April 2026 (UTC)

Quick facts WebArchiver, Stable release ...

WebArchiver

Stable release	0.1.0

Background

The Web ARChive (WARC) format, standardized as ISO 28500:2017, is the primary file format used for storing web crawl data.^[2] It was originally developed by the International Internet Preservation Consortium (IIPC) and is used by major institutions including the Internet Archive and national libraries worldwide.^[3]

While the WARC format itself is an open standard, many of the tools for creating and viewing WARC files are either commercial products, command-line-only utilities, or components of larger institutional archiving systems. WebArchiver was developed to address this gap by providing a self-hosted, browser-based tool that handles the complete archiving workflow — from crawling to replay — in a single application.

Architecture

WebArchiver uses a client–server model with a Node.js backend and React frontend, packaged as a Docker container for deployment.

WARC engine

The software includes a custom WARC 1.1 engine that implements the full ISO 28500:2017 specification. It supports all eight WARC record types (warcinfo, response, resource, request, revisit, metadata, conversion, and continuation), per-record gzip compression for random access, SHA-256 digest computation for integrity verification, and deduplication through revisit records.^[2]

The parser operates as a streaming finite-state machine that processes WARC files incrementally, supporting multi-gigabyte files without loading them entirely into memory. A CDX (Capture Index) maps each archived URL to its byte offset within the WARC file, enabling direct access to individual records.

Web crawler

The crawler uses Playwright to control a headless Chromium browser, capturing not only the initial HTML but all resources loaded dynamically by JavaScript — including ECMAScript module imports, XHR/Fetch responses, web fonts, and CSS. The crawler waits for all network activity to settle before completing a page capture, ensuring that single-page applications and sites using dynamic module loading are fully archived.^[1]

Multiple pages are crawled concurrently using a worker pool model. A per-domain rate limiter serializes requests to the same domain while allowing parallel requests to different domains, preventing excessive load on target servers. The crawler respects robots.txt directives by default.

During each crawl, a full-page screenshot is captured and stored as a WARC resource record within the archive file.

Replay system

The replay system serves archived content directly from WARC files in the user's browser. It employs a multi-layer approach to ensure archives are fully self-contained:

Server-side URL rewriting — All URLs in HTML attributes (href, src, srcset, etc.), CSS (url(), @import), and HTTP redirect Location headers are rewritten to route through the replay server

Client-side interception — An injected JavaScript shim overrides fetch() and XMLHttpRequest to rewrite dynamically constructed URLs at runtime

Content Security Policy — A strict CSP header (connect-src 'self') instructs the browser to block any request that escapes the URL rewriting

Iframe sandboxing — Archived content is rendered in a sandboxed iframe that prevents navigation to external sites

HTTP redirect chains (such as CDN 302 redirects) are resolved internally within the archive, serving the final destination content transparently.

Version control

WebArchiver supports re-crawling archived sites to create new versions. A diff engine compares versions at the URL level using payload digests, identifying resources that were added, removed, or changed between crawls. This enables tracking of how websites evolve over time.

Technology

More information Component, Technology ...

Component	Technology
Backend	Node.js 20, TypeScript, Fastify
Frontend	React 19, Vite, Tailwind CSS
Database	SQLite with WAL mode
Crawler	Playwright with Chromium
Container	Docker
License	MIT License

The application uses SQLite as its database, requiring no external database server. All data — the WARC files, CDX indexes, crawl metadata, and audit logs — is stored in a single data directory, simplifying backup and migration.

Comparison with other tools

More information Tool, Type ...

Tool	Type	WARC support	Browser rendering	GUI	Self-hosted	License
Wayback Machine	Service	Yes	No (server-side)	Yes (web)	No	N/A
Heritrix	Crawler	Yes	No	No (CLI)	Yes	Apache 2.0
pywb	Replay	Yes	No (server-side)	Minimal	Yes	GPL 3.0
Webrecorder	Service/Tool	Yes (WACZ)	Yes (Service Worker)	Yes	Partial	AGPL 3.0
wget	Crawler	Yes	No	No (CLI)	Yes	GPL 3.0
WebArchiver	Full stack	Yes (1.1)	Yes (Playwright)	Yes (web)	Yes	MIT License

Draft:WebArchiver

Background

Architecture

WARC engine

Web crawler

Replay system

Version control

Technology

Comparison with other tools

See also

References

External links

References

Related Articles