On the difficulty of order constrained pattern matching with applications to feature matching based malware detection
Adiesha Liyanage, Braeden Sopp, Binhai Zhu
TL;DR
The paper studies the computational complexity of detecting malware via order-constrained feature matching, formalizing the General-OMDCI framework and its variants $OMDCI$ and $OMDCI^+$. It establishes strong hardness results: $OMDCI^+$ is NP-complete, and deciding whether the optimal $OMDCI$ solution has length $0$ is co-NP-hard via a reduction from the complement Hamiltonian Cycle problem. The authors introduce gadget-based reductions and carefully analyze the structural constraints on matches to prove these results, highlighting fundamental limits for low-level malware detection using feature matching. The findings imply that both malware identification and verification of absence are difficult in general, motivating future work on restricted parameters, cross-architecture settings, or architecture-aware sematics to enable practical detection methods.
Abstract
We formulate low-level malware detection using algorithms based on feature matching as Order-based Malware Detection with Critical Instructions (General-OMDCI): given a pattern in the form of a sequence \(M\) of colored blocks, where each block contains a critical character (representing a unique sequence of critical instructions potentially associated with malware but without certainty), and a program \(A\), represented as a sequence of \(n\) colored blocks with critical characters, the goal is to find two subsequences, \(M'\) of \(M\) and \(A'\) of \(A\), with blocks matching in color and whose critical characters form a permutation of each other. When $M$ is a permutation in both colors and critical characters the problem is called OMDCI. If we additionally require $M'=M$, then the problem is called OMDCI+; if in this case $d=|M|$ is used as a parameter, then the OMDCI+ problem is easily shown to be FPT. Our main (negative) results are on the cases when $|M|$ is arbitrary and are summarized as follows: OMDCI+ is NP-complete, which implies OMDCI is also NP-complete. For the special case of OMDCI, deciding if the optimal solution has length $0$ (i.e., deciding if no part of \(M\) appears in \(A\)) is co-NP-hard. As a result, the OMDCI problem does not admit an FPT algorithm unless P=co-NP. In summary, our results imply that using algorithms based on feature matching to identify malware or determine the absence of malware in a given low-level program are both hard.
