Growing Your Malware Corpus
If you’re writing YARA rules or doing other kinds of detection engineering, you’ll want to have a test bed that you can run your rules against. This is known as a corpus. For your corpus you’ll want to have both Goodware (known good operating system files), as well as a library of malware files.
One source to get a lot of malware samples is from VX-Underground. What I really appreciate about VX-Underground is that in addition to providing lots of malware samples, they also produce an annual archive of samples and papers. You can download a whole year’s worth of samples and papers, from 2010 to 2023.
Pandora’s Box
Just to understand the structure here, I have a USB device called “Pandora.” On the root of the drive is a folder called “APT”, and within that is a “Samples” directory. Inside the samples directory is the .7z download for 2023 from VX-Underground. There’s also a python script… we’ll get to that soon enough.
The first thing we’ll need to do is unzip the download with the usual password.
7zz x 2023.7z
Once the initial extraction is complete you can delete the original 2023.7z archive.
Within the archive for each year, there is a directory for the sample, with sub-directories of ‘Samples’ and ‘Papers.’ Every one of the samples is also password protected zip file.
This makes sense from a safety perspective, but it makes it impossible to scan against all the files at once.
Python to the Rescue
We can utilize a Python script to recursively go through the contents of our malware folder and unzip all the password protected files, while keeping those files in their original directories.
You may have noticed in the first screenshot that I have a script called ExtractSamples.py in my APT directory.
We will use this for the recursive password protected extractions.
Python ExtractSamples.py
A flurry of code goes by, and you congratulate yourself on you Python prowess. Now if we look again at our contents, we’ve got the extracted sample and the original zip file.
Let’s get rid of all the zip files as we don’t need them cluttering up the corpus.
We can start by running a find command to identify all the 7zip files.
find . -type f -name '*.7z' -print
After you’ve checked the output and verified the command above is only grabbing the 7z files you want to delete , we can update the command to delete the found files.
find . -type f -name '*.7z' -delete
One more a directory listing to verify:
Success. All the 7z files are removed and all the sample files are intact.
GitHub Link: ExtractSamples.py
Time to go write some new detections!