The main problem was dex-oracle didn’t work “out of the box”. It took some “hacking” to make it work. Specifically, I modified an existing deobfuscation plugin to create two new plugins as well as slightly modify the app. It’s really hard to make completely generalized deobfuscation tools, or any kind of advanced tool, so you’ll need to know how it works in order to modify it to suit your needs.
Here’s the SHA256:
|
|
I like to start with a decompilation just to get a high level overview of the package structure. Here’s what the class list:
Some class names have been ProGuard’ed (a
, b
, c
, etc.) but some haven’t (Ceacbcbf
). These unobfuscated classes are probably Android components (activity, service, broadcast receiver, etc.) which must be declared in the manifest. Thus, any tool which automatically renames them would also have to rename them in the manifest, which is hard. These may have been manually changed. The obfuscation is probably home-made and partially done by hand. This means it’s probably malicious because a legit developer would probably pull a commercial obfuscator off the shelf and just use that. They wouldn’t waste time changing their class names to something indecipherable like Aeabffdccdac
.
The code is obfuscated. Below is a class which shows the obfuscation:
You can’t see any strings or class names, which is really annoying. This looks like something Simplify can handle, but, spoilers, it fails miserably. That’s fine. I have many tricks up my sleeve. Let’s take a look at the Smali and see if anything jumps out.
The first type of obfuscation which jumped out at me was an “indexed string lookup” type obfuscation.
|
|
This pattern is found hundreds of times in the code. It takes a number, passes it to f.a(int)
, and gets a string back. This is some basic “level 1” style encryption. There’s probably a big method somewhere which builds an array of strings that the number indexes into.
A second type of obfuscation hides class constants using an identical technique:
|
|
This code passes a number to g.c(int)
and gets back a class object (const-class
).
You may be thinking you’ll have to reverse engineer the lookup methods, and you’d be wrong. It’s cool and all to deep dive into the complex code and completely master it by writing a decryption routine. But honestly, fuck that. Speed is the name of the game, and I really don’t have time to fuck around with this malware author’s bullshit, retarded, home-made, amateur hour obfuscation. Instead of reversing everything, consider that these “lookup” methods are both static. It should be possible to just execute them with the same inputs from the code to get back the decrypted output. For example, in the case of string decryption, I should be able to execute f.a(0x320fb26f)
and get back the decrypted string.
The question is, of course, how do you execute just the target method code? It’s an APK. How can you execute just the method you want with the inputs you want? How do you harness the target methods? There are two paths you can go by:
As it happens, I’ve already created dex-oracle which does #2. I like #2 more than #1 because it doesn’t rely on decompilers which often introduce subtle logic bugs. However, I’ve used #1 a few times in a pinch, so it’s worth mentioning. I went about adding support for this type of obfuscation to dex-oracle. the plugins were added in Add indexed string + class lookups.
The way dex-oracle works is pretty simple. It contains a collection of plugins which define regular expressions which pull out key bits of information – method calls and arguments. Then, it constructs real method calls with the arguments you pull out and passes them to a driver which executes the original DEX file on an emulator. Finally, the plugin defines how the driver output should be used to modify the method.
For example, the regular expression could look for “a const number, a call to a static method which takes a number and returns a string, and moves the result to a register”. Then, the driver executes that method with the number and returns the decrypt string. Finally, the original string lookup code is replaced with just the decrypted string. You can read more about how it works in TetCon 2016 Android Deobfuscation Presentation.
Unfortunately, even with the new plugins, dex-oracle fails. To keep things simple, I disable all plugins except IndexStringLookup and I only process the d
class from the picture example above.
|
|
The Invalid date/time in zip entry
stuff is just noise. Maybe they tried obfuscating the timestamp in the ZIP? I dunno.
What concerns me is the Unsuccessful status: failure for Error executing 'static java.lang.String xjmurla.gqscntaej.bfdiays.f.a(int)' with 'I:839889519'
. The error tells me there’s a NullPointerException
when it executes f.a(int)
. Looks like every time it tried to call that method, it failed. So, let’s look at f.a(int)
.
|
|
The entire method is pretty small. Just subtracts the first argument from a big constant and uses that as an index into a string array, Lxjmurla/gqscntaej/bfdiays/f;->k:[Ljava/lang/String;
. Well, let’s look out f;->k
is initialized.
|
|
There’s only one sput-object
and it’s in xjmurla/gqscntaej/bfdiays/Ceacabcbf.smali
. By looking for this line in Ceacabcbf
, we find private Ceacabcbf;->a()V
. This is a big, long, complicated method which contains a HUGE string literal which is processed, chunked up, and stored in f;->k
. Hmm, our NullPointerException
is caused by this field not getting initialized. This means that Ceacabcbf;->a()V
is not getting called during execution of the string decryption method. Well, when is it called?
|
|
Ahh, it’s only called in Ceacabcbf
. Let’s find that.
|
|
It’s called in Ceacabcbf;->onCreate()V
. This class is a subclass of Application
. Without looking at the manifest, I’m pretty sure that when the app starts, this component is created, onCreate()V
is called, the decrypted string array is built, and most importantly f;->k
is initialized. Hmm, how can I make it so that dex-oracle calls this method when decrypting strings?
My first thought is to add a method call to Ceacabcbf;->a()V
in f;-><clinit>
. This ensures that when the string decryption class f
is loaded, it initializes the decrypted string array. BUT, a()V
is direct. WHAT TO DO?
Well, this is kind of dumb but it works sometimes. Just create a new public, static method called Ceacabcbf;->init_decrypt()V
and copy the code from Ceacabcbf;->a()V
. Then, add a line to call this method in f;-><clinit>
:
|
|
After making some changes which hopefully work, need rebuild the DEX from the modified Smali and try dex-oracle on it.
|
|
No errors. Let’s see the decompilation.
|
|
Oh, hello there Mr. C&C domain! GET REKT BRO.
Ok, but that still leaves the class deobfuscation. That’s still annoying, right? Well, to keep this post short, dex-oracle fails when deobbfuscating classes for the same reason as it originally failed for strings. The same Ceacabcbf;->a()V
method needs to be called.
The same trick can be used – just call Ceacabcbf;->init_decrypt()V
in g;-><clinit>
. However, g
doesn’t have a <clinit>
so you’ll have to add one:
|
|
Now, rebuild and let dex-oracle do it’s thing:
|
|
Let’s see if the decompilation looks any different.
|
|
There’s not much difference for this method, but other methods have a lot more information, especially in the Smali where you can see lots of const-class
es. There’s still one call to g.c(int)
which isn’t deobfuscated. I found out that this is because the method call succeeds but returns null
. Maybe that’s why it’s in a try-catch? Maybe it’s trying to load a class which doesn’t exist on every Android API version?
One final test: run it against the entire DEX file.
|
|
It worked. Cool. Now there are lots of strings! This should also make it a lot easier for Simplify to work because there’s less code to execute and fewer places to fail.
Hopefully after reading this you have better idea of how to bend dex-oracle to suit your needs. It’s pretty flexible and great when you can isolate the code you need to run to a single method. Sometimes you need to make changes to an Android app to help dex-oracle, but modifying Smali is relatively easy to modify and a lot of malware doesn’t bother doing anti-tampering checks.
]]>Recently I had far too much time on my hands and a Kext binary which seemed to pique my interest. After spending a bit of time analyzing the binary in IDA Pro, I wanted to prove out some theories I had by debugging it. A while back I had set up MacOS to be running as a QEMU/KVM machine - though I no longer had access to the hardware that I set this up on. The purpose of the previous use case was to have lots of instances up (fuzzing) as opposed to in depth debugging, and I had never actually wondered about debugging the kernel. Anyhoo - I decided to revisit setting up a virtualized instance of MacOS and decided to go the VMWare Fusion route. I had a license on the computer I had in front of me, wanted to continually do snapshots, and just assumed it would be easy to get it working locally. Well, I was sort of right?
The bulk of the VMWare fusion part was just following the knowledgebase article from VMWare - there really isn’t any magic to do there.
After getting the VM built and set up - all the sources I found online seem to point out you will need to disable SIP and get your host environment setup. Patrick Wardle documented this process quite well over on his blog, though it didn’t “just work” for me - though I kept being stumped as to why. Honestly, I still have no idea what the issue was, though I’ve been able to implement a workaround for the time being.
To summarize the steps from Patrick’s page, we need to do the following;
csrutil disable
.Enable Debugging in the Guest environment
After the VM reboots, open a terminal and change the boot-args
by doing the following;
|
|
Reboot the VM
This is the first step that didn’t work out quite the way I had hoped. According to most sources online, setting debug=0x141
should cause the system to prompt you with a Waiting for remote debugger connection.
while booting up. However, this never occured for me. After Googling more and more, I couldn’t really find anyone who had mentioned this issue (which is the main motivation for writing this) - so I pushed on until I found a better explanation of the boot args. According to the Apple Developer Documentation Page by setting 0x141
- these are the correct flags for us to set. Since 0x141 = (DB_HALT | DB_ARP | DB_LOG_PI_SCREEN)
, however it would appear the DB_HALT
option is non-functional at this point in time. If anyone knows the reasoning behind the, or if this is just a weird blunder on my part, feel free to comment here or shoot me a message. I cannot seem to find any real reasoning behind this no longer working.
The workaround for this, which I assume everyone doing kernel debugging is using at this point, is to use the DB_NMI
flag, so the command we run to properly set up the boot-args
will be;
Then reboot the machine.
This allows us to have the debugger listen for Non-Masking Iterrupts, which we can cause at any time. These can be create by pressing Esc + Control + Option + Command
at the same time - if on a laptop where you have turned on the “Use function keys as function keys” option, you’ll need to hold the fn
key as well. This will overlay text on the top left of your screen indicating the IP address to connect too.
lldb
on the host machine and point it at the kernel you just downloaded
|
|
Now, if you have your guest properly set up and waiting for the debugger, you could now attach lldb
directly to the ip address.
Voila! Well, sort of? It did work for a short time, approximately ~60 seconds or so. The debugger appears to attach fine, breakpoints would be set and hit. Though after the first minute or so, it would seem the the remote connect somehow would continuously drop. Neither lldb or the guest environment would notice this or complain - just every command would seemingly either silently fail or error out for unknown Python reasons.
At this point I was getting a bit frustrated. I had to have done something wrong: The entire set up got trashed and I started again, checking every step to ensure I was doing it correctly. Though the resulting set up seemed to always have the same outcome - 60 or so seconds of debug time and then a reboot would be required to connect again. This clearly wasn’t a workable option. I blindly started tweeting some rage about how silly debugging kernel code on MacOS seemed to be, no documentation I could find correctly explained getting it working, and seemingly no one had ever run into this problem. Magically, complaining on twitter did something and a friend I met at Hoodsec mentioned something along the lines of “lldb kdb over udp is often laggy and not stable, use gdb”. Without attempting to start an emacs vs vim
style fight, I immediately loved the idea since I prefer gdb
over lldb
anyway - it just seems to be a comfort zone for me. Off to Google - more about using gdb
to debug kexts I come across Snare’s post on the matter.
Not only is this post simple to understand, it is essentially the exact setup I was using. Turns out that VMWare made it pretty easy for us, since they have a debugStub
which can be enabled on any VM. Opening up the VM config file, for me it was in ~/VMs/OSX10_11_5.vmwarevm/OSX10_11_5.vmx
and adding the following lines at the bottom (while VM is not running).
This seems like it will work great, except Apple no longer ships gdb
nor does it ship any macros to assist debugging for gdb
anymore. Luckily someone has done all the work for us, thanks OSXreverser! Pedro wrote a great article a few years back about compiling gdb
which can be found on his blog. After that, go snag the repo gdbinit/kgmacros which contains the older macros which /mostly/ work for newer kernels. If you didn’t already have the .gdbinit
script from Pedro, you should also get and install that. After getting all this preparation work done, fire the VM back up and prepare gdb before connection. Target the kernel the guest machine is using, add the symbols for it and then load the helper macros and connect to the guest.
Awesome! Now we have a fully functional MacOS guest and a host connected with a debugger. Haven’t had any issues with disconnects yet while using gdb
. It also might be worth noting that many people have said you can also connect lldb
to this debugStub using it’s gdb-remote
command using the command (lldb) gdb-remote localhost:8864
.
Afterthoughts - something very wrong might be lurking in my set up and may have been causing the udp issues with the kernel debugger, especially since I can’t really find anyone else discussing this problem. I was also loaded on pain medication due to a motorcycle accident, so it is extremely likely that I misread something or came up with my solutions in backwards ways. Regardless, this seems to have worked. Discussing this on twitter and slack with a few people, it seems like many others rely on the VMWare debugStub - though @i0n1c disagrees with me and said there must be something wrong with my setup. He is probably correct. If I end up solving the underlying issue, I will post the solution here. This blog was primarily just to serve as a culmination of all the random things I ended up trying to get this to work so I don’t have to go through the pain again. Hopefully someone else finds this useful!
Special thanks to @tamakikusu, @OngEmil and all of @RedNagaSec for your assistance both in knowledge, editing and insults to keep the world humble.
]]>The past few years have been interesting in terms of surveillance and nation state purchased malware. Gamma Team (FinFisher) got owned, followed by Hacking Team having all the source code for their implants being posted on GitHub. Just this year, Hacking Team lost their global license to sell spyware. I’m unsure how this really would affect their business. The linked article explains the situation better than I ever could. To quote the article, it means:
Hacking Team will have to apply for an individual [export] license for each country. It will then be up to the Italian authorities to approve or deny any requests.
Maybe someone can shed light on what this actually means? Does that mean that a license must be acquired for the country in which the implant is being deployed or does it mean the license must exist for the country which the buying entity exists? Regardless, it would seem that recently the Hacking Team has had their global license reinstated. So, in theory none of this matters… Or does it?
The export license Hacking Team requires aren’t easy to look up and victims of their implants aren’t coming forward publicly. Do they even know they’re infected? Do they just want to avoid publicly saying they got owned? It’s anyone’s guess.
In this post, we’ll describe what we believe to be active Hacking Team Android implants. We’ll also provide evidence that these implants were being actively developed such as the number of different versions and the incremental advances and changes between them. We hope that this analysis will be helpful to those who might come across it in the wild and that it’ll provide a starting point for the researcher community to piece together the full story of where these implants are being deployed or if Hacking Team’s export licenses are being abused.
Worst case scenario? This’ll be an interesting blog about some spyware that wasn’t too hard to reverse and it ends up being a bit more expensive to operate since all the AVs will detect it in a week or two.
TL;DR Don’t sell spyware even if it’s “regulated”. If you do, make it more fun to reverse next time please. Enough soapboxing, let’s start this post!
-diff
Caleb and I were recently contacted by someone claiming to have an “advanced malware” sample which had been deployed against one of their coworkers. This type of claim comes up more than you would think. Usually it’s just a very paranoid person who doesn’t know how to use Occam’s razor and has a computer glitch or mysterious reboot and they assume someone must be attacking them. We were understandably skeptical of the claim, so we followed up with a barrage of questions. Interestingly, the more answers we got back, the more it seemed we were dealing with a legitimate threat. At first, our contact thought it was FinFisher because they had looked at this malware family in the past and they looked similar.
Unlike a paranoid delusion, this claim was backed up by actual files for us to analyze! While we cannot release these files due to an agreement with our contact and an ongoing criminal investigation, we have been able to find several similar files in the wild through other public feeds which closely resemble the sample we were provided. The functionality hardly changes between versions and the obfuscation is the same. Since these other samples are already publicly available, we feel comfortable talking about this threat. While I often bash companies for pushing PR and marketing content without sharing binaries, I feel that this is different. I can’t share the specific sample we were given but I do provide nearly identical samples and analysis of the techniques of the original sample. This will easily allow other researchers to reproduce the results, formulate their own blog posts, and most importantly, protect themselves and their customers. Also, since I don’t work for any anti-virus company, I’m not trying to push my product over anyone else’s right now! Hooray somewhat moral high ground! With all this in mind, the analysis is a little tailored since it was done twice. With this visible part on a new binary found in the wild and already available on VT. However, I’ll be taking the same approach I took on the original binary.
First, let’s look at what’s inside the APK.
|
|
Nothing sticking out here. No native binaries to dig into. No hidden packages. No large high-entropy files without an extension floating around. A very vanilla looking Android application.
Now, let’s get some information about the signing certificate.
|
|
Nothing super interesting in the certificate except the common names (CN) are both ...
where as a legitimate developer would use their name or the company’s name or pretty much anything other than an ellipsis. Since every APK must be signed to be installed and most malware authors are lazy, they tend to use the same certificates between versions and even across malware families. You can search for other apps signed by the same certificate with Koodous.
These results show three other applications with the same certificate. This means these apps were likely created by the same person unless their private key was leaked. Sadly, none of them seem to have been analyzed much or voted on by anyone. If we look at the hashes of these files on VirusTotal, we also don’t see anyone talking about them and weak detection ratios which would indicate no one seems to know their significance.
When we dig into the Android Manifest, we see standard malware / spyware behavior: ask for absolutely every permission:
|
|
There are a few tidbits from the manifest which strike me as interesting right away, other than the inordinate amount of permissions being requested. First, the package name it.phonevoda.androidv1
seems interesting as many legitimate apps start with the default com.
prefix. Honestly, this could be nothing or it could be attempting to look like a something to do with Vodaphone Italy. I’ve never personally seen anything from an Italian specific phone. However the structure doesn’t ring any bells.
It’s also interesting to note that the package name does not match the class paths of the activities, services and receivers do not match up with this package name. For example, there is a service with the namespace com.google.android.MainService
which sounds like it’s trying too hard to look like an official package Android package. Another service has the namespace com.package._p
and is just simple a System Service
. The MainActivity is com.google.android.system.MainActivity
but is also labeled Aggiornamento Android
which is Italian for Updating Android
. Sounds legit.
To sum it up, we have an app requesting almost every permission possible, claims to be an Android update, and purports to have something to do with Vodaphone APNs. These all seem… Normal, right? Yea, not really…
Throwing the DEX file into IDA Pro and looking at a MainService.onCreate()
, we immediately see something somewhat interesting;
This clearly shows an encrypted / obfuscated string. Looking at the Strings tab, we see many more obfuscated strings.
As we back out to onCreate()
, we can see that the string decryption method is likely String com.google.gson.JsonNull.startsWith(String, int)
. Oh, that is cute. They’re attempting to hide their method signatures in plain sight by giving them legitimate looking names. Maybe this is to avoid “easy” signatures since a signature on this method name may false positive? Or maybe this is just a simple attempt to make a reverser’s life a bit harder.
The decryption method itself is actually quite easy to reverse:
It’s just a a modified XOR cipher with a modifier being passed in as an argument. Translated to Python:
|
|
I wanted to dump all of decrypted strings to a file and also inline them as comments where they were being used. The decryptor.py
IDA plugin below works by looking in the Dalvik code for the opcodes const-string
paired with const/16
to get the encrypted string and XOR cipher mod argument. Then, it looks for the invoke-static
opcode with the method JsonNull.startsWith()
. If this pattern is matched we can pass the arguments into our reversed decryption method to get the decrypted string. Finally, this string is added as a comment near the encrypted string. The processes reuses some of the code for adding load strings for Go files described in a previous blog post.
It turned out there was more than just the JsonNull.startsWith()
decryption method. I saw the literal values of -0x5
and -0xB
change between the decryption methods. To support these other methods, I moved these out of the code and into the mod1
and mod2
arguments.
Please note that the way this code loads strings from the string table is annoying. After messaging IDA support about why it was so difficult, they informed me that there was a better way. Apparently, I should have used the DecodeInstruction
function. I’ll likely try to rework this code later to use this.
After a bit of movie magic, we end up with the code below.
|
|
After running this, we can see that we have comments for all the decrypted strings. Awesome!
After creating this code, Caleb also informed me that Simplify would also have worked. So many different ways to skin a cat!
After decrypting the strings, the rest of the behavior is easy to follow. The class names and most of the interesting method names are not obfuscated. We can see that this implant has the normal abilities of most spyware:
MainService
and set an alarm to keep it persistent873451679TRW68IO
and reply or forward messages with device informationWhen data is exfiltrated, it’s serialized in an encrypted form to %SDCARD%/Android/data/__android.data
. Naturally, I wanted to actually know what data were being exfiltrated. So I started to dig into the app’s decrypted strings to try and figure out what data were being exfiltrated and what the C2s were. Because all the strings had previously been dumped to one file, it was easy to look for domains, IP addresses, or just http:
or https:
.
|
|
I’ve snipped the output here for brevity. The full output is in the Appendiex for easy indexing of search engines. Some of the strings are unique across binaries and this may help people in the future.
While skimming the strings, it’s immediately interesting that there appear to be Italian phrases such as Servizi Google
(Google Service
) and Aggiornamento effettuato con successo
(Successful Update
). These strings are actually shown to the user and must be part of the app’s cover.
Looking at the strings shows two servers to dig into:
These are interesting as they are not using domains which could mean a few things. One is that these people are lazy. Another possibility is that they’re purposefully avoiding DNS to avoid getting detected by anyone smart enough to use passive DNS searching. Everyone in the information security space knows DNS is the (old) new hotness and maybe they realize this. Without telling a friend what exactly I was investigating, I shared these IP addresses. They plugged them into whatever information feeds they had and something popped up. Oh hey, these appear to be previously used HackingTeam C2s!
|
|
Reference: https://github.com/passivetotal/HT_infra/blob/master/68.233.232.104.passivetotal.pdns
|
|
Reference: https://github.com/passivetotal/HT_infra/blob/master/68.233.232.147.passivetotal.pdns
Granted, this could be a coincidence or a false flag of sorts and it’s hard to say for sure. But the 68.233.237.11
IP address is using an Italian SSS certificate which can be used to find other connections in passive datasets. I’ll just leave this here:
|
|
The last thing I wanted to do was to understand what the traffic actually looked like even though it’s going through SSL/TLS. My original thought was that performing a man-in-the-middle would require getting a device and installing a certificate or bypassing certificate pinning to allow Burp to intercept traffic. This turned out not to be the case. In fact, there is a “vulnerability” in this implant or maybe they are just lazy. If we dig into their custom SSL handling code (which is seemingly labeled as normal Android code) in com.google.android.common.HttpUtils.allowAllSSL()
we see rather boiler plate code for disabling SSL certificate checking.
Wait. What?
Why are they transporting information over SSL but explicitly not checking certificates? Here is the hand-decompiled pseudo-Java for allowAllSSL()
:
|
|
The code is similar to the StackOverflow questions asking how to access untrusted certificates for SSL HTTPS connections. No big deal. It’s not like this type of implant would ever be deployed to gather sensitive information, right? (hint: sarcasm) Technically this code does allow them to trust a self signed certificate or previously untrusted out, though it just let the application accept any certificate. This means we don’t have to do anything special for man in the middling; just literally be in the middle. So just fire up Burp, or whatever, and start an interceptor grab a .pcap and look at the heartbeats going to the server:
From this point, it’s relatively easy to watch the traffic. There really isn’t much going on outside of the run-of-the-mill, boring, commercial spyware junk. The secret sauce is likely found after talking to the C2 server and getting the extra payloads. This would appear to be where exploits are being delivered, however it would seem these are set up and configured on the back end. Sadly I was unable to coerce the back end to give me anything worth analyzing. Since the pcap would all be encrypted, captured POSTs to and from the server have been added to the Appendix.
Honestly? I don’t have definitive proof though there is a decent amount of circumstantial evidence:
These could all be false flags, as I’ve stated before, so take it as you will. I did try to find a contact at HackingTeam. However they didn’t seem to want to reply to me – neither for confirmation that their implant is being used in the wild nor about the vulnerability in their code.
This implant has been floating around and can easily be downloaded for researchers but I don’t believe anyone has publicly spoken about these, which is why I’ve written this. My gut tells me if any AV companies had found this, they’d be foaming at their mouths to publish something for the PR value. Based on the VirusTotal detections of these samples, some people are (blindly?) flagging these files. So again, either they don’t know what they have, or maybe they don’t care to talk about it. Hopefully this brings some attention to it and boosts the detection on these implants and also aids researchers looking to understand these threats.
Special thanks to @ACKFlags, Caleb Fenton at SentinelOne, @jsoo and all of @RedNagaSec for your assistance on this one :D
|
|
Analyzed in this post:
|
|
Similar samples:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To illistrate some of my examples I’m going to use an extremely simple ‘Hello, World!’ example and also reference the Rex malware. The code and a Make file are extremely simple;
|
|
|
|
Since I’m working on an OSX machine, the above GOOS
and GOARCH
variables are explicitly needed to cross-compile this correctly. The first line also added the ldflags
option to strip the binary. This way we can analyze the same executable both stripped and without being stripped. Copy these files, run make
and then open up the files in your disassembler of choice, for this blog I’m going to use IDA Pro. If we open up the unstripped binary in IDA Pro we can notice a few quick things;
Well then - our 5 lines of code has turned into over 2058 functions. With all that overhead of what appears to be a runtime, we also have nothing interesting in the main()
function. If we dig in a bit further we can see that the actual code we’re interested in is inside of main_main
;
This is, well, lots of code that I honestly don’t want to look at. The string loading also looks a bit weird - though IDA seems to have done a good job identifying the necessary bits. We can easily see that the string load is actually a set of three mov
s;
|
|
This isn’t exactly revolutionary, though I can’t off the top of my head say that I’ve seen something like this before. We’re also taking note of it as this will come in handle later on. The other tidbit of code which caught my eye was the runtime_morestack_context
call;
|
|
This style block of code appears to always be at the end of functions and it also seems to always loop back up to the top of the same function. This is verified by looking at the cross-references to this function. Ok, now that we know IDA Pro can handle unstripped binaries, lets load the same code but the stripped version this time.
Immediately we see some, well, lets just call them “differences”. We have 1329 functions defined and now see some undefined code by looking at the navigator toolbar. Luckily IDA has still been able to find the string load we are looking for, however this function now seems much less friendly to deal with.
We now have no more function names, however - the function names appear to be retained in a specific section of the binary if we do a string search for main.main
(which would be repesented at main_main
in the previous screen shots due to how a .
is interpreted by IDA);
|
|
Alright, so it would appear that there is something left over here. After digging into some of the Google results into gopclntab
and tweet about this - a friendly reverser George (Egor?) Zaytsev showed me his IDA Pro scripts for renaming function and adding type information. After skimming these it was pretty easy to figure out the format of this section so I threw together some functionally to replicate his script. The essential code is shown below, very simply put, we look into the segment .gopclntab
and skip the first 8 bytes. We then create a pointer (Qword
or Dword
dependant on whether the binary is 64bit or not). The first set of data actually gives us the size of the .gopclntab
table, so we know how far to go into this structure. Now we can start processing the rest of the data which appears to be the function_offset
followed by the (function) name_offset
). As we create pointers to these offsets and also tell IDA to create the strings, we just need to ensure we don’t pass MakeString
any bad characters so we use the clean_function_name
function to strip out any badness.
|
|
The above code won’t actually run yet (don’t worry full code available in this repo ) but it is hopefully simple enough to read through and understand the process. However, this still doesn’t solve the problem that IDA Pro doesn’t know all the functions. So this is going to create pointers which aren’t being referenced anywhere. We do know the beginning of functions now, however I ended up seeing (what I think is) an easier way to define all the functions in the application. We can define all the functions by utilizing runtime_morestack_noctxt
function. Since every function utilizes this (basically, there is an edgecase it turns out), if we find this function and traverse backwards to the cross references to this function, then we will know where every function exists. So what, right? We already know where every function started from the segment we just parsed above, right? Ah, well - now we know the end of the function and the next instruction after the call to runtime_morestack_noctxt
gives us a jump to the top of the function. This means we should quickly be able to give the bounds of the start and stop of a function, which is required by IDA, while seperating this from the parsing of the function names. If we open up the window for cross references to the function runtime_morestack_noctxt
we see there are many more undefined sections calling into this. 1774 in total things reference this function, which is up from the 1329 functions IDA has already defined for us, this is highlighted by the image below;
After digging into mutliple binaries we can see the runtime_morestack_noctext
will always call into runtime_morestack
(with context). This is the edgecase I was referencing before, so between these two functions we should be able to see cross refereneces to ever other function used in the binary. Looking at the larger of the two functions, runtime_more_stack
, of multiple binaries tends to have an interesting layout;
The part which stuck out to me was mov large dword ptr ds:1003h, 0
- this appeared to be rather constant in all 64bit binaries I saw. So after cross compiling a few more I noticed that 32bit binaries used mov qword ptr ds:1003h, 0
, so we will be hunting for this pattern to create a “hook” for traversing backwards on. Lucky for us, I haven’t seen an instance where IDA Pro fails to define this specific function, we don’t really need to spend much brain power mapping it out or defining it outselves. So, enough talk, lets write some code to find this function;
|
|
After finding the function, we can recursively traverse backwards through all the function calls, anything which is not inside an already defined function we can now define. This is because the structure always appears to be;
|
|
The above snippet is a random undefined function I pulled from the stripped example application we compiled already. Essentially by traversing backwards into every undefined function, we will land at something like line 0x0808994B
which is the call runtime_morestack
. From here we will skip to the next instruction and ensure it is a jump above where we currently are, if this is true, we can likely assume this is the start of a function. In this example (and almost every test case I’ve run) this is true. Jumping to 0x08089910
is the start of the function, so now we have the two parameters required by MakeFunction
function;
|
|
That code bit is a bit lengthy, though hopefully the comments and concept is clear enough. It likely isn’t necessary to explicitly traverse backwards recursively, however I wrote this prior to understanding that runtime_morestack_noctxt
(the edgecase) is the only edgecase that I would encounter. This was being handled by the is_simple_wrapper
function originally. Regardless, running this style of code ended up finding all the extra functions IDA Pro was missing. We can see below, that this creates a much cleaner and easier experience to reverse;
This can allow us to use something like Diaphora as well since we can specifically target functions with the same names, if we care too. I’ve personally found this is extremely useful for malware or other targets where you really don’t care about any of the framework/runtime functions. You can quiet easily differentiate between custom code written for the binary, for example in the Linux malware “Rex” everything because with that name space! Now onto the last challenge that I wanted to solve while reversing the malware, string loading! I’m honestly not 100% sure how IDA detects most string loads, potentially through idioms of some sort? Or maybe because it can detect strings based on the \00
character at the end of it? Regardless, Go seems to use a string table of some sort, without requiring null character. The appear to be in alpha-numeric order, group by string length size as well. This means we see them all there, but often don’t come across them correctly asserted as strings, or we see them asserted as extremely large blobs of strings. The hello world example isn’t good at illistrating this, so I’ll pull open the main.main
function of the Rex malware to show this;
I didn’t want to add comments to everything, so I only commented the first few lines then pointed arrows to where there should be pointers to a proper string. We can see a few different use cases and sometimes the destination registers seem to change. However there is definitely a pattern which forms that we can look for. Moving of a pointer into a register, that register is then used to push into a (d)word pointer, followed by a load of a lenght of the string. Cobbling together some python to hunt for the pattern we end with something like the pseudo code below;
|
|
The above code could likely be optimized, however it was working for me on the samples I needed. All that would be left is to create another function which hunts through all the defined code segments to look for string loads. Then we can use the pointer to the string and the string length to define a new string using the MakeStr
. In the code I ended up using, you need to ensure that IDA Pro hasn’t mistakenly already create the string, as it sometimes tries to, incorrectly. This seems to happen sometimes when a string in the table contains a null character. However, after using code above, this is what we are left with;
This is a much better piece of code to work with. After we throw together all these functions, we now have the golang_loader_assist.py module for IDA Pro. A word of warning though, I have only had time to test this on a few versions of IDA Pro for OSX, the majority of testing on 6.95. There is also very likely optimizations which should be made or at a bare minimum some reworking of the code. With all that said, I wanted to open source this so others could use this and hopefully contribute back. Also be aware that this script can be painfully slow depending on how large the idb
file is, working on a OSX El Capitan (10.11.6) using a 2.2 GHz Intel Core i7 on IDA Pro 6.95 - the string discovery aspect itself can take a while. I’ve often found that running the different methods seperately can prevent IDA from locking up. Hopefully this blog and the code proves useful to someone though, enjoy!
Why is this important? Any app which has had malware injected into it or has been cracked or pirated will have probably been disassembled and recompiled by dexlib. Also, there are very few reasons why a developer with access to the source code would use dexlib. Therefore, you know an app has been modified by dexlib, it’s probably interesting to you if you’re worried about malware or app piracy. This is where APKiD comes in. In addition to detecting packers, obfuscators, and other weird stuff, it can also identify if an app was compiled by the standard Android compilers or dexlib.
APKiD can look at an Android APK or DEX file and detect the fingerprints of several different compilers:
If any of the dexlib families have been used to create a DEX file, you can be fairly suspicious it has been cracked and it may have been injected with malware. For more info on how we used compiler fingerprinting to detect malware and cracks, check out our talk Android Compiler Fingerprinting.
The main way dx and dexmerge are identified are by looking at the ordering of the map types in the DEX file.
This is a good place to identify different compilers because the order is not defined in the spec so it’s up to the compiler how it wants to order these things.
In order to have something that’s copy / paste-able, here’s some Java code for the normal type order:
|
|
The dexmerge type order was derived by looking at DexMerger.java. I got the typeIds order by looking here.
|
|
In general, the format of a DEX file and the items inside are like this:
|
|
This list may be handy for ongoing research into fingerprinting different compilers.
This is the first library that allowed disassembling and compiling of DEX files without the source code. It was created by Ben “Jesus Freke” Gruver. It’s detected primarily by looking at the physical sorting of the strings.
The DEX format requires that the string table, which list all the strings and their offset into the file, must be sorted alphabetically, but the actual physical ordering of the strings in the file are not necessarily sorted. So while dx sorts strings alphabetically, even though it doesn’t have to, dexlib seems to sort them physically based on when they’re encountered during compilation.
A lot of commercial packers and obfuscators and certain malware families still use dexlib 1.x under the hood because it’s pretty solid and they’re too lazy to update.
Dexlib 1.x was rewritten into dexlib 2, and while it was in a beta release, we noticed that it did something weird with how it marked class interfaces.
You can see AC 27 00 00
all over the file. That’s the offset to the “null” interface for classes which don’t implement any interface. It’s a good example of how flexible the DEX format is, because I would figure this wouldn’t run at all, but it does. The dx compiler just uses 00
s to indicate that there’s no interface.
This was removed before dexlib 2.x was moved out of beta.
This compiler is also detected by also looking at the map type order. Assembling a DEX file is complex and there are a lot of tiny little details you need to mimic to create an absolutely perfect facsimile. That’s a lot of extra work most developers don’t want to do.
As an aside, I spend a lot of time using this library and looking at it’s code while working on a generic Android deobfuscator called Simplify. And I’ve got to say, it’s some really impressive and clean code that I’ve learned a lot from. Kudos to Ben.
The usage of APKiD is quite simple. You just point it at folders, files, whatever, and it’ll try and find APKs and DEX files. It’ll also decompose APKs and try and find compressed APKs, DEX, and ELF files. Here’s output of an example run:
|
|
You can see that the test samples of DexGuard and DexProtector both use dexlib 1.x. APKiD also supports JSON output so it’s easier to integrate into other toolchains:
|
|
This post leaves out all of the Android XML fingerprinting details Tim researched that can identify tools like Apktool. We still need to add these fingerprints into APKiD.
There is also a library called ASMDEX which looks capable of creating DEX files. At the time of this original research a few years ago, I didn’t have time to look into it, and no one was talking about how to use it. A lot of the stuff was over my head, but I’ve since had a lot of practice using ASM to create Java class files, so I think I can manage now. It would be nice to add fingerprints for ASMDEX. Anything created by that would probably be pretty weird.
]]>Here’s the full abstract:
Compiler fingerprinting is a technique for identifying the compiler used to create a binary. This is because there is some flexibility in file formats, and different compilers usually produce binaries with identical behaviors but with subtle differences in structure and organization. We developed a tool which can determine the compiler used to create Dalvik executables and Android binary XML files. This allowed us to distinguish between apps compiled from original source code and apps which had been modified using non-standard compilers such as dexlib. Our hypothesis was that the two primary reasons reasons for modifying an Android app were for 1.) cracking and piracy and 2.) injecting malicious code. We tested this assumption by comparing the compiler profiles of various app markets with varying tolerances for cracked and malicious apps to see if the percentage of modified apps was inversely proportional to how strict the store was about policing submissions. We found that strict markets such as Google Play and Amazon had significantly lower rates of modified apps compared to less strict markets such as Aptoide and BlapkMarket. Additionally, we analyzed ~138,000 benign apps and known malware samples to compare the rates of modification between both groups. We found much higher rates of modification within the malware sample set with many malware families consisting entirely of modified apps.
This talk presents the history and evolution of various Android tools, introduces tools for fingerprinting compilers, summarizes the technical details for how the tools work, and reviews applications for using compiler fingerprinting to improve detection and classification of malware and pirated apps.
We’ll post the video of our talk as soon as it’s available!
This is the output when it’s run against our test files:
|
|