For a little side project, I’ve recently had to dig into the details of the ubiquitous ZIP file format. That might sound boring, but it’s worth understanding that ZIPs are absolutely everywhere, even when it doesn’t look like it. For example, Microsoft Office formats (.xlsx, .pptx), Java’s JAR, WAR, and EAR files or the EPUB book format are all just ZIPs in disguise. Most programming languages have built-in ZIP support, and macOS and Windows make creating ZIPs a one-click affair. I doubt there’s many large software systems that don’t rely on ZIP at least in some hidden corner, from banking to interplanetary missions. It’s safe to say that many millions of ZIP files have been created and continue to exist somewhere on SSDs, magnetic disks or even tapes, and that number just keeps growing.
Why? ZIP is a practical solution for file archiving, i.e. packing multiple files into one. It keeps directory structures intact, optionally saves space with built-in compression, and allows selective extraction without decompressing everything. This flexibility helped it gain traction even in the UNIX world, despite the dominance of its older, native tar
.
One might assume such a fundamental format would be rigidly governed by a formal standards body like ISO or IETF.
Not quite. Instead, ZIP was the brainchild of Phil Katz and Gary Conway in the late 1980s BBS scene. While most of its rivals like ARC, PAK, and ARJ have faded into obscurity, ZIP thrived because Katz and Conway released its specification into the public domain. Calling it a “specification” is generous, though – the famous APPNOTE.TXT has always been more of a personal notebook than a rigorous standard. Despite some improvements over the years, it remains fragmentary. While it describes the on-disk data structures in some detail, it offers little guidance on correct usage and implementation. A good deal of the text simply describes obscure variations of the format, many of which have become irrelevant. To this day and long after Phil Katz’s death, PKWARE maintains this same document, and even the ISO/IEC 21320-1:2015 standard merely references it. Secondary sources, like Wikipedia and developer blogs, offer more insight but aren’t always reliable.
ZIP has its share of oddities, some of which are due to its origins in the late 8-bit home computer and early IBM PC world, while others are just questionable design choices. Metadata, like file names and sizes, is stored twice, yet the format isn’t particularly resilient to corruption. There are countless ways to construct ZIP files that are technically valid but still break various tools. “ZIP bombs” exploit this by creating tiny files that decompress into absurdly large ones. Even something as simple as nesting a ZIP within a ZIP can trigger unexpected results due to the way “signatures” mark file boundaries.
Different ZIP parsers interpret these quirks in their own ways, leading to compatibility headaches. Info-ZIP’s implementation, last updated in 2008 and originally built as an attempt to create a standard implementation without an actual standard, is still a common reference, along with the more recent 7-Zip. Standard libraries in Java, Python, and C# agree on the basics but diverge when handling ZIP’s many extensions. Even the ZIP64 extension – introduced more than 20 years ago to support large archives – isn’t uniformly implemented. It’s fascinating (and slightly unnerving) to browse the code of various open-sourced ZIP implementations and see how differently their authors interpret the format and how they struggle with the many small variations of ZIP files that exist.
And yet, despite all of this, ZIP keeps working. There’s enough overlap between implementations to ensure sufficient usefulness. Developers avoid obscure features, learn from each other’s code, and just work around the format’s warts. Built for a different era, never formally standardized, and quasi-maintained by a loosely connected global community, ZIP persists.
It may not be perfect, but for close to 40 years, it’s been good enough. Few pieces of software can make this claim.