Wikipedia: Obfuscated code/Talk

Am I the only one who thinks this article has a few inaccuracies? Obfuscation of the source code makes no difference to the output of a decompiler. Variable names in the source code of any compiled language is reduced to addresses in the final object code. These addresses take the same space regardless of how long the variable name was in the first place.

I'm hacking this article until its not wrong!

You are absolutely wrong. Your model is too simplistic, and will even be more so after the C++ committe meets in a few years. Your claims only hold true for a very few languages, like C.

I wrote these parts of the article from a position of experience, while your claims are from a position of theory. Therefore your theory does not consider enough variables. One phrase: Symbol tables. These tables are important for many forms of dynamic programming (such as Reflection) and are stored with the bytecode/executable/interpreted script in a great number of languages.

That is two articles of mine that have been deleted, out of three. One of them having been my user page. This is extremely frustrating. I am just deleting my entire entry until I have time to revert it with deeper explanations. -- forgotten gentleman

And furthermore, there is more to obfuscation than just simple renaming. I've heard that Microsoft in its early days inserted instructions that confused and broke common debuggers, to resist analysis of their programs. Obfuscation also covers the realm of destroying all structure that would lend a program to human-readability.

If you look at sourcecode, it is well-structured. It is decomposited into functions for Structured Programming, objects of OOP, etc. Compilers tend to propagate this structure into compiled code. Obfuscators erase as much of this as possible. -- forgotten gentleman

The following sections of this article were removed (by "forgotten gentleman"); I'm putting them on this /Talk page for future reference.

Uses for obfuscation

There is generally little point in plain obsfuscation of source code although some cases include:

Attempting to "protect" the IP of something that has to provided in source form (for platform portability)
Some code-generators (e.g. SDL) generate hard to read code, on the basis that you should tweak the design not the code.

When dealing with interpreted languages it could be argued that smaller (but less undestandable) variable names will keep code size down. However this is a false economy OK that wasn't NPOV

Problems with obfuscation

Debugging

Obfuscated code is extremely difficult to debug. Variable names will no longer make sense, and the structure of the code itself will likely be modified into unrecognizability. This fact generally forces developers to maintain two builds: One that can be easily debugged, and another for release. Both builds should be tested to make sure they act identically.

Defective obfuscators

Occasionally an obfuscator may be buggy, in a difficult to reproduce way. There is little one can do except find a newer version or fiddle with any inputs to the obfuscator until it works.

: You are absolutely wrong. Your model is too simplistic, and will even be more so after the C++ committe meets in a few years. Your claims only hold true for a very few languages, like C.

Well what may happen in the future is irrelevant to an encyplodedia. My claim holds true from at least all compiled languages I've dealt with. Bytecode I'm less familiar with so I'll have to check my facts on that, although I'm sure its very implentation dependant.

: I wrote these parts of the article from a position of experience, while your claims are from a position of theory. Therefore your theory does not consider enough variables. One phrase: Symbol tables. These tables are important for many forms of dynamic programming (such as Reflection) and are stored with the bytecode/executable/interpreted script in a great number of languages.

I'm sorry but symbol tables are an irrelevance from my experience. If you don't want people to know the names of your functions you strip the symbol table. Its only of use to debuggers.

: That is two articles of mine that have been deleted, out of three. One of them having been my user page. This is extremely frustrating. I am just deleting my entire entry until I have time to revert it with deeper explanations.

I don't know what other pages have been deleted, I'm just arguing about this one.Welcome to Wikipedia

: And furthermore, there is more to obfuscation than just simple renaming. I've heard that Microsoft in its early days inserted instructions that confused and broke common debuggers, to resist analysis of their programs. Obfuscation also covers the realm of destroying all structure that would lend a program to human-readability.

I'm sorry but breaking debuggers is not the issue here. People have written plenty of hairy code to attempt to confuse debuggers (I know I've had to bypass some of them), but thats a whole different ball game (and completly pointless IMHO).

: If you look at sourcecode, it is well-structured. It is decomposited into functions for Structured Programming, objects of :OOP, etc. Compilers tend to propagate this structure into compiled code. Obfuscators erase as much of this as possible.

Unless your talking about binary obfiscators your the type your talking about must just wreak code they touch. A linked list looks the same in assembly if its nodes are called wikjn and koip instead of n and p. Anything that fiddles with internal structures will just break stuff. If binaries generated by the source-code and its ofuscated equivilent are not identical then its defeated its own point.

I don't want this to become a "Your Wrong" and "I'm right" argument so I would welcome any other points of view?

Since neither of you two care to sign your respective statements, I don't know which if yours is which, but I'm sure at least one of you is clearly not helping to write a useful article. The fact is, "obfuscators" are commonly used pieces of commercial software that perform a specific function for specific reasons, and your opinion of their worth is out of place here. This is an encyclopedia, not a chat room. If you think obfuscators are worthless (I happen to agree for different reasons), then don't write one or buy one, but don't interfere with someone trying to write a useful article on the topic. Symbol removal and substitution is one method (if you think symbol tables can just be stripped, then you obviously don't program in a language with reflection), as is code rearrangement, debugger fouling, and others. Commercial software has used these and other methods, and they should be documented here. --LDC

: Since neither of you two care to sign your respective statements, I don't know which if yours is which, but I'm sure at least one of you is clearly not helping to write a useful article.

I think we both want a useful article. Hence I've moved the discussion into talk to gain consensus.

: The fact is, "obfuscators" are commonly used pieces of commercial software that perform a specific function for specific reasons, and your opinion of their worth is out of place here.

As you may of picked up I don't think obfuscators are worth much at all but agree there should be an article about them. I just don't want it to be inaccurate.

: This is an encyclopedia, not a chat room.

I hope its debate rather than chat.

: If you think obfuscators are worthless (I happen to agree for different reasons), then don't write one or buy one, but don't interfere with someone trying to write a useful article on the topic.

I'm venturing an opionion as to why I think the article if flawed. I attempted to correct it the Wikipedia way, found the original author disagreed and reverted the change, so I question the assumptions and statements in /Talk. Now we are debating to get a better article.

: Symbol removal and substitution is one method (if you think symbol tables can just be stripped, then you obviously don't program in a language with reflection)

The article already states you can't obfuscated code that uses reflection. Where mangling the names is allowed where do you need the symbol table? Answer you don't, therefor can strip it. The renamed functions aren't hiding anything in the underlying code. But the way the article is written thats what is implied. See:

: This is wrong on both counts. You can use obfuscators on languages with reflection, and even on code that specifically uses reflection (though you have to account for it). Secondly, renaming variables and methods does hide information. Human-meaningful names are valuable information to a (human) debugger and reverse-engineer. I don't see how you can possibly argue otherwise. Renaming may not hide what the code actually does, but it does hide the programmer's intent, and sometimes that's enough to seriously impede reverse-engineering. --LDC

: Compilers tend to propagate this structure into compiled code. The job of a good obfuscator is to destroy as much as possible of this structure that lends a program to being human-readable.

A good obfuscator is one that can generate logically identical source code that creates indentical binaries for textually different source code. Which is the exact opposite to what the article says.

: , as is code rearrangement, debugger fouling, and others.Commercial software has used these and other methods, and they should be documented here. --LDC

No. They are different techniques for defeating crackers. They are different from obfuscating source code. Sure document them in Wikipedia but not in this article.

: We seem to be disagreeing about the definition here, so let's quare that away first. An "obfuscator" (as the term is used by actual commercial software available today) is a program that makes reverse-engineering difficult. Changing source code into logically indentical source code is one--and only one--method of doing that, and it is indeed the origin of the term. But even the source-only obfuscations are not purely textual, some are algorithmic (i.e., making the algorithm difficult to follow). Another method is changing the actual structure of the code to something functionally equivalent (but not logically indentical). Another is applying transforms on the resulting object code. Have you actually browsed to see what kinds of obfuscators are available and what they actually do, or are you just talking theory from some textbook? --LDC

I accept blame if I've pushed this into a heated argument. I have unfortunately been in a foul mood for the last days, and I know this has seeped through into this article. (I admit, it did not help for you to say that you simply couldn't stay within the NPOV, marring the article right in the middle to say that.) But let's get past that and get to writing good articles. I'll offer my input to a few points raised here.

: As you may of picked up I don't think obfuscators are worth much at all but agree there should be an article about them.

I agree that they're white elephants. To keep NPOV, I limited myself to an enumeration of advantages and disadvantages.

However, on a project where code size reduction was absolutely crucial, it worked well. (FYI, this was in Java. The size reduction was definitely nontrivial, even accounting for compression.)

: I'm sorry but breaking debuggers is not the issue here. People have written plenty of hairy code to attempt to confuse debuggers (I know I've had to bypass some of them), but thats a whole different ball game (and completly pointless IMHO).

I would suggest what you've encountered was an operation on the code which kept it to spec, but made it more difficult for you to "read and understand." From my perspective, having been on a project where they actually dedicated a programmer to parse obfuscated code to hand off to a fairly untrusted remote team... it fits all reasonable definitions of "obfuscated code" to me. In spirit and to the letter.

If your experience is mainly about the fun obfuscation contests, I can see how you would disagree.

: A good obfuscator is one that can generate logically identical source code that creates indentical binaries for textually different source code. Which is the exact opposite to what the article says.

We have different conceptions of obfuscators. I believe you have your conception from the fun contests. In my conception, an obfuscator keeps the translated program conformant to a black-box spec. Since obfuscation has all sorts of side-effects (performance, code size, etc), it is not always possible to use an obfuscator which works well on most other programs.

: Where mangling the names is allowed where do you need the symbol table? Answer you don't, therefor can strip it.

I thought about explaining this by referring to lexical scoping, but I thought better of it. Now I wish I had.

Here, I believe you're thinking too quickly. Static languages promote a sense of determinism. Dynamic ones don't, because code from outside the system may happily interact with its internals. So, getting rid of the symbol table is a net loss of information that may have undesirable effects. One bad effect is to destroy the whole point of having a dynamic language.

Tell me quickly how to strip the symbol table for a Java program, using javac. You can't. The symbol table is not merely a debugging aid in dynamic languages; it's the point. However, an obfuscator may do this, fully or partially, depending on how much you need the dynamism.

: The article already states you can't obfuscated code that uses reflection.

That's definitely a valid misunderstanding of the article's earlier versions. I was very unclear.

You can configure an obfuscator to leave parts of your code unrenamed.

Anyway.

I wish this article was expanded much more to explain these contests. I want the boring IP aspects to be a smaller section. However, I can't really speak about the contests, since I've never participated. -- forgotten gentleman

Again, I am being extremely imprecise WRT reflection -- you can obfuscate code with reflection, you just have to make sure everything calls the obfuscated name. I don't know why I didn't mention this; it's always the main source of complexity when doing painfully obfuscated builds.

It just defeats the point, since oftentimes classnames are constructed according to a spec. It's not just a matter of "having a name." -- forgotten gentleman

I've moved a few bits of text around and re-arranged the recreational obfuscation bit and added an example the usenet. I've not touched the second part but I'm still unhappy with statements that refer to messing with the internal structure of the code. Maybe a category like "Protecting code from reverse engineering" would better cover some of these weird and wonderful binary obfuscators.

Does the usenet example really have tremendously long lines, or did a couple of newlines get lost somewhere? --AxelBoldt

Actually, please modify as you wish. I promise to not throw a tantrum again. ;-) I was not myself; dealing with a very volatile person last week, I became irrational too.

I don't quite understand what you don't like about the internal structures part, so I'll just see what your modifications are. If you mean that breaking internal structure doesn't decrease codesize much with current obfuscators, maybe I'd better look more closely to see where the size savings come from.

Funny, I just now looked at the Jax homepage http://www.research.ibm.com/jax/ and they use language similar to this article's. I definitely should have done research, because they are more precise than my off-the-cuff revisions. -- forgotten gentleman

very short and unimportant comment. Is really "atacker" an appropiate name for somebody trying to do reverse engineering? --AN