A graph summarizing the results

Editor’s Note: It’s challenging to use machine learning. This article by Polina Khapikova, Akshatha Muralidhar, Muhammad Qureshi, and Willie Santos outlines their approach to use ML to classify Malware applications on Android.

Introduction

This week, our group approached the final part of our research project: implementing the machine learning algorithms we worked on previously, to classify Android malware. In this post, we will discuss the steps we took and the challenges we came across to modify our existing program to work with Android files. We also implemented data visualization at the end, in order to get a more pictorial view of the results we obtained.

Methods and Challenges

First, we had to do some research about what apk files were. Similar to PE files, or .exe files for the Windows operating system, .apk is a file format used by the Android operating system to install and execute applications.

To build our dataset, we found clean apk files on sites like apkmirror.com and apk-dl.com. For our malicious files we used the github repository https://github.com/ashishb/android-malware.

We identified several features of the files to use for the Machine Learning algorithms. Two features we considered using were the size of the file and its certificate. However, we later removed these features. The file size was evenly distributed between the malicious and non-malicious applications, so it did not help us with our classification. Likewise, we learned that Android requires a certificate for every application, so this also did not help the algorithms categorize the files.

Thus, the main features we used were the permissions that an application requests from the system. Android splits the app permissions into two categories — “normal” and “dangerous”. The normal category is made up of permissions that Android does not think pose a security risk — for example, whether or not the phone is connected to a wifi network. The dangerous category is made up of permissions that could pose a security risk, allowing the user’s privacy to be compromised, or their data to be accessed or modified. This category includes permissions such as using the phone’s camera, recording audio, or reading text messages. (The dangerous permissions are the one Android asks about when you download apps through the Play Store. For more information, you can go to this link: https://developer.android.com/guide/topics/permissions/requesting.html)

We used a github repository called apk_parse, that itself relied on the popular Android malware repository Androguard, to extract features. However, most of the features we found were returned as strings, which was incompatible with the ML algorithms we are using. To avoid having to research and select new algorithms, we modified the majority of our features to be Binary: whether or not the specified permission is requested by the application. (The one exception to this is the file size, which was returned as an integer).

Another unexpected challenge that we faced was finding a solid source where we could download malicious apk files. The only site we were able to pull malicious files from was a github repository (mentioned earlier). As of right now, roughly 35% of our collected apk files are malicious and the rest are clean. We hope to find more malicious files but we will be working with the files that we have already collected up to this point.

Another problem was that we originally studied, and began writing in Python 3. However, the github repositories we found used Python 2, so we had to convert all of our code into Python 2 so the different files would be compatible.

Visualization

We used the matplotlib Python library to make a bar graph (shown below) so we could visualize how the features correlated to the maliciousness of the application. From the bar graph we derived from our python code, we can infer that the percentage of non-malicious files that required permissions were lesser than the percentage of malicious files that required permissions. This data makes sense theoretically, because malicious APK files will want more of the user information to use for compromising their privacy than non-malicious APK files.

In order to dissect the permissions that the malicious and non-malicious APK files asked for, we created a table.

Results

From the data in the table, we can notice that 93.75% of the malicious APK files want to read the user’s phone state. The read phone state permission, allows the app to read the user’s phone number and serial number. It can also detect when a call is active, and the number it is calling. You can see why this would be a good permission for malicious applications to gain access to your information.

Future Steps

Since our internship is coming to an end, we are trying to wrap up our project this week. However, we are looking into ways to extend this project if a later opportunity presents itself.

One way would be to experiment with using different features. For example, what providers, receivers, services, and activities the app uses.

Another direction would be to use a framework such as Kivy to turn our program into an Android application. This could then be installed on a phone and work as an antivirus. It would use the machine learning techniques to correctly scan the android system, detect and remove the malware.