Malware Blues! (Source:

Editor’s Note: This article describes the project “GotMalware?” to explore Malware fingerprinting and visualization techniques, it’s been developed with help from MalwareBytes & Lib13 Inc as part of the Cyber Defenders 2017 Program.

The Problem

Malware infects computer systems as well as mobile devices with malicious software that has the intent to obtain secured private information, delete and modify important information. In our project, we want to identify the fingerprints of different malware by looking at the files of a computer before and after infection. Then, we want to learn how to visualize the effect malware has on a computer system. We eventually would like to explore the desktop Malware Analysis techniques on a mobile phone — especially Android Devices.

What are we trying to do?

Although there is a lot of software available that can detect malware and prevent computer systems from getting infected. Our objective in this project is to observe malware behavior and the footprints it may leave behind, by comparing files and associated signals from a regular test bed environment to files in an infected test bed environment. The testbed will provide us with a platform to understand malware detection better, and to develop tools for the same. We will test three different types of malware to compare the different fingerprints they leave behind.


To accomplish this goal, we first surveyed the current research that exists regarding malware detection as well as the use of machine learning in malware detection.

Here are some of the papers we read:

  • _Idika, Nwokedi, and Aditya P. Mathur. “A survey of malware detection techniques.” Purdue University 48 (2007) : _This paper discussed two of the most common techniques used in malware detection: anomaly based detection and signature based detection. Link

  • _Ahmed, Faraz, et al. “Using spatio-temporal information in API calls with machine learning algorithms for malware detection.” Proceedings of the 2nd ACM workshop on Security and artificial intelligence. ACM, 2009 : _These researchers ran malware and benign software in a sandbox environment, analyzed its behavior, and used different algorithms to classify the software as malware or non-malware. Link

  • _Liao, Ken. “Solution Corner: Malwarebytes Endpoint Protection.” Blog post. Malwarebytes Labs. Malwarebytes, 27 June 2017. Web. 30 June 2017: _This blog post explains how MalwareBytes already incorporates machine learning into their products. Link

  • _Siddiqui, Muazzam, Morgan C. Wang, and Joohan Lee. “A survey of data mining techniques for malware detection using file features.” Proceedings of the 46th annual southeast regional conference on xx. ACM, 2008: _This article was a survey of different data mining techniques from 19 different studies. Link

  • _Alazab, Mamoun, et al. “Zero-day malware detection based on supervised learning algorithms of API call signatures.” Proceedings of the Ninth Australasian Data Mining Conference-Volume 121. Australian Computer Society, Inc., 2011 : _This research group used machine learning to identify zero-day malware based on its frequency of Windows API calls. Link

Our Approach

We are planning to take the following steps:

  1. Learning about required tools: Our internship includes a Java course, but because Python has much better libraries for data analysis and visualization, we decided to learn and use it for our project.

  2. Creating a malware analysis test bed: We are writing a Python program that will index the files (make an organized list of all the files along with their sizes) on multiple virtual machines (software that emulates a mini computer inside of your main computer). Then, it will compare the directories and generate a report that tells the user the modifications in the files caused by the malware.

  3. Infect the virtual machines with different types of viruses and compare the files between the infected machines and a clean machine.

  4. Extract meaningful features from our samples. These features will be the basis of our study; features are what describe something, for example, the features of a house are: number of rooms, area of the house, Price of the house.

  5. Visualize data. Malware is a threat to anyone who uses a computer, but many people have only a vague idea of what is and what the effects can be. We aim to write something that will help people clearly visualize the effect of malware in their computers.

  6. Use machine learning on the prepared dataset.

Why is it beneficial?

Malware is a serious, constantly changing threat. Creating a program that will identify malware, and help people see the effect malware will have on their systems will assist them in seeing the practical effects of malware and make more informed decisions in the future.

Can this be done in a better way?

A bonus part of our project (if time permits) is to use machine learning techniques to identify malware. Because malware is constantly changing to avoid the latest detection techniques, machine learning can be crucial in identifying forms of malware that are not currently known, but are similar to already known strains.

What have we done until now?

Our team has worked with Java Virtual Box to set up Windows 10 virtual machines. We have also studied the programming language Python, by taking the Introductory and Intermediate Python for Data Science courses on DataCamp.

This week, we began writing our code. So far, we have two programs written: one that indexes the files on two virtual machines, and another that compares these directories to determine what files have been changed by the virus.

We have also experimented with other file comparison programs, mainly ‘ExamDiff Pro’ to get an idea of how a file comparison program works and the footprints it might find. Specifically, we used Metasploit to make a malicious pdf, and compared it with a benign pdf in ExamDiff. This will help us learn behaviours of malware so we have an idea of what results we should expect to find when we run our own program.

Our next step is to find three viruses and infect the virtual machines with them.

Code Review — Please?

Following is some of the code we plan to use, please review and advise:

  • Code we plan to use for line by line file comparison: Here

  • Code we plan to use to compare two directories and save results to a text : Here