ZIP files are often used to spread phishing scams, ransomware, and other types of malware. Because the format isn't usually executable (except for self-extracting ZIPs), it hasn't gotten as much attention as executable formats. This blog post looks at how the format can be used and tells what we did to fix it.
There are many types of compressed file formats, like tarballs (.tar.gz), RAR Archives (.rar), and 7Zip (.7z), but ZIP has become the basis for many popular file formats and the term for compressing and bundling files in general. It is the basis for Microsoft OfficeOpenXML files (docx, xlsx, and pptx file extensions), Java Archives (JAR), Android Packages (APK), and Electronic Publication (EPUB) files. ZIP structures can also be found in self-extracting EXEs, PDFs, and other formats that are hard to find.
Because of the redundancies in the ZIP format and Postel's Law, which says, "Be conservative in what you send and liberal in what you accept," many different ZIPs are "acceptable." Along with the different kinds of ZIP files, ZIP parsers can also make choices that can cause a single ZIP file to produce different results depending on which ZIP parser is used. In this post, we talk about some of the same information and parts of the format that leave the meaning of a file up to the program that reads it. Opponents can take advantage of these different ways of seeing things, and we'll give you a hint about what we're doing to look at these maliciously made ZIPs from all sides.
Mix-ups with ZIP Codes
The ZIP format has been shown visually by corkami and others, but for this example, I will try to explain it with just two pieces of data that can be used to name a file. Outside of symlinks and hard links, most filesystems ensure that each file name is unique. ZIP does not, on the other hand.
If you look at corkami's ZIP 101 poster, you can see that, unlike many other formats, ZIP is meant to be read from the bottom up. But if you read backward, strange things can happen.
The file name can be stored in the ZIP format in two different places: the Central Directory Entry and the Local File Header. The Central Directory is at the end of the file, right before the End of Central Directory structure, and each of the stored files has a Local File Header that comes before it. You can confuse parsers by giving each structure the same data two different names. If a ZIP parser has been patched to fix the ZIP Slip vulnerability, it doesn't do much harm, but it's an excellent place to start looking at how different parsers handle badly-formed ZIP files.
This shows in one script how two functions that look very similar can produce different results. In the first, the ZIP file is read in a mode called "streaming," which only uses the Local File Header to find files. In the second, the ZIP file is read from the Central Directory Entry after its data has been proven safe. System Enhancement Associates (SEA) ARC format, popular in the early days of floppy shareware, only had file headers and no central directory.
SEA sued Phil Katz over his PKARC utility, which made him make ZIP. From what I can disclose from the original release documents, the central directory helped keep track of archives spread across multiple floppy disks. Only the save icon is left for floppies, but ZIP is still around.
Even though this isn't a threat in and of itself (or even something experienced Rubyists wouldn't expect), ZIPs can do more than give a file two possible names. File sizes, compression methods, duplicate names, and zips that have been joined together can make it hard for programmers to open them safely.
File Sizes
File sizes, including the size of both compressed and uncompressed data, are stored in the Central Directory Entry or the Local File Header. File sizes can also be stored in ZIP64 extensions or in a structure called the Data Descriptor, which acts as a footer for the stored data. Depending on how the files are stored, differences in the compressed size can lead to different extracted files (i.e., added to the ZIP uncompressed).
Methods of Compression
Compression methods show what algorithm was used to compress the data in the ZIP file, or if none was used because the file was too small to compress or the ZIP creator chose to leave it uncompressed. They take up a whole field in the above structs, but only the DEFLATE method is widely used. Other methods, like LZMA, are occasionally seen. Because DEFLATE blocks can also be uncompressed, the compression method combined with differences in size and offset can make it possible for one parser to extract only parts of a file while another can extract the whole file.
Similar Names
Similar to how the names foo and bar got mixed up in central bar-localfoo.zip, ZIP files can be made with names that clash with each other differently. Multiple Local File Headers can have the same name, Central Directory Entries can report duplicate names, and there's no way to stop one from overwriting the other if the application for unzipping files is writing files without thinking about it.
Combining ZIP files
Concatenated ZIPs, like this nanocore delivery method, is one of the easiest ways to avoid being caught. Considering the above problems, it's easy to see how two ZIP files that have been joined together could look like a single valid ZIP file to a parser. However, which ZIP file gets extracted depends on how it was made. In all of these cases, we wanted a parser that wouldn't have an opinion but would give us every file that could be hiding inside a ZIP. Here's another example of how concatenated ZIP codes can confuse.
No more confusion about the Central Directory Offset
The End of Central Directory (EOCD) is the last part of a well-made ZIP file. It's kind of like if the last page of a book told you which page the index (the Central Directory) was on. In books, we usually think of the first page as the first one, and in binary, we typically think of the first offset as the first one. But ZIP can be included in other file formats, which means that "offset zero" can mean different things to different people.
Different parsers can make other decisions. Python's zip file, Rust's zip-rs, and Info-Zip (the standard unzip command on Linux and Mac) all use the offset to the Central Directory as an offset from the beginning of the file. When Go extracts data from a ZIP archive, it starts at the beginning of the compressed data. This means that you can have two different Central Directories within the same file, and depending on how the same offset is interpreted by the unzipping tool, you can see two completely different sets of files.
In addition to file names and EOCD offsets, there are many other ways to trick ZIP parsers. If an attacker knows what tool is used to read their ZIP payload, they can probably make a ZIP file that confuses that tool. Or, they can make the ZIP file look like another file so it can't be found. For example, some threat actors will have their stager download what looks like an image file but is a ZIP file with the file fingerprint of the image file prepended or just renamed.