翻译文章,原文链接:https://www.evilsocket.net/2019/05/22/How-to-create-a-Malware-detection-system-with-Machine-Learning

在这篇文章中,我们将讨论我喜欢的两个主题,这些主题是过去7年中我(私人)研究的核心要素:机器学习和恶意软件检测。

我受过相当经验性和绝对非学术性的教育,我知道一个热情的开发人员想要接近机器学习并努力理解正式定义,线性代数和诸如此类的东西。因此,我将尽可能保持这种实用性,以便即使是受过较少教育的读者也能理解并可能开始使用神经网络。

此外,大部分资源都集中在众所周知的问题上,例如MNIST数据集上的手写数字识别(机器学习的“hello world”),同时让读者想象更复杂的工程系统应该如何工作。通常是指如何处理非图像的输入。

TL; DR:我不擅长数学,MNIST很无聊,检测恶意软件更有趣:D

我还将这个用作ergo的一些新功能的示例用例,一个我和chiconara不久前开始的项目,用于自动化机器学习模型创建,数据编码,GPU培训,基准测试和大规模部署。

与这篇文章相关的源代码可以在这里找到。

重要说明:仅此项目不构成商业防病毒的有效替代品。

问题定义和数据集

传统的恶意软件检测引擎依赖于签名的使用 ——恶意软件研究人员手动选择的唯一值用以识别恶意代码的存在,同时确保非恶意样本组中没有冲突(称为“误报”)。

这种方法有一些问题,通常很容易绕过(根据签名的类型,恶意代码中的一位或几个字节的变化可能会使恶意软件无法检测到)并且当研究人员的数量比他们需要手动逆向工程、识别和写入签名所需的独特恶意软件系列的数量少几个数量级时,这种方法就不能很好地扩展。

我们的目标是教授计算机,更具体地说是人工神经网络,在不依赖于我们需要创建的任何显式签名数据库的情况下检测Windows恶意软件,但通过简单地摄取恶意文件的数据集,我们希望能够检测并从中学习以区分恶意代码,无论是否在数据集本身内部,最重要的是,在处理新的、看不见的样本时,我们唯一知道的是哪些文件是恶意的而哪些不是,但不知道是什么使它们如此,我们将让ANN(人工神经网络)完成其余的工作。

为了做到这一点,我收集了大约200,000个Windows PE样本,均匀地分为恶意(在VirusTotal上检测到10多个)和干净(已知并且在VirusTotal上有0个检测)。由于在相同的数据集上训练和测试模型没有多大意义(因为它可以在训练集上表现得非常好,但是根本无法对新样本进行概括),这个数据集将通过遍历自动划分为3个子集:

  • 训练集,70%的样本,用于训练。
  • 验证集,15%的样本,在每个训练阶段对模型进行基准测试。
  • 测试集,15%的样本,在训练后对模型进行基准测试。

毋庸置疑,数据集中(正确标记的)样本的数量是模型准确性的关键,它能够正确地分离这两个类并将其推广到看不见的样本 ——在训练过程中使用的越多越好。此外,理想情况下,应使用较新的样本定期更新数据集,并对模型进行重新训练,以便即使在野外出现新的独特样本时(即:wget + crontab + ergo),也能保持较高的精度。

由于我在这篇文章中使用的特定数据集的大小,我无法在不占用带宽的情况下共享它:

但是,我在Google云端硬盘上传了dataset.csv文件,提取了大约340MB,你可以用它来重现这篇文章的结果。

可移植的可执行格式

Windows PE格式有丰富的文档记录和许多理解其内部的好资源,例如Ange AlbertiniExploring the Portable Executable format 44CON 2013演示文稿(从我拍摄下图)可以在线免费获取,因此我不会花太多时间研究细节。

我们必须牢记的关键事实是:

  • PE有几个标头描述其属性和各种寻址细节,例如PE将在内存中加载的基地址以及入口点的位置。
  • PE有几个部分,每个部分包含数据(常量,全局变量等),代码(在这种情况下,该部分被标记为可执行)或有时两者都包含。
  • PE包含导入API和从哪些系统库导入的声明。

致Ange Angeini的作品

例如,这是Firefox PE部分的样子:

致“Machines Can Think”的博客

虽然在某些情况下,如果PE已经使用诸如UPX之类的打包程序进行处理,那这部分可能看起来有点不同,因为主要代码和数据部分已经过压缩,并且在运行时解压缩的代码存根已添加:

致“Machines Can Think”的博客

我们现在要做的是看看如何将这些本质上非常不同的值(它们是所有类型的区间数和可变长度的字符串)编码成标量数的向量,每个向量在区间[0.0,1.0]中归一化,并且长度不变。这是我们的机器学习模型能够理解的输入类型。

确定要考虑的PE的哪些特征的过程可能是设计任何机器学习系统的最重要部分,这被称为特征工程,而读取这些值并对其进行编码的行为称为特征提取

特征工程

创建项目后:

ergo create ergo-pe-av

我开始在encode.py文件中实现特征提取算法,这是一个非常简单的起点(包括注释和多行字符串在内150行),它为我们提供了足够的信息来达到令人感兴趣的精度水平,并且在将来可以通过附加功能轻松扩展。

cd ergo-pe-av
vim encode.py

我们向量的前11个标量编码了一组布尔属性,LIEF,我正在使用的QuarksLab中的令人惊奇的库,从PE解析 ——每个属性如果为真,编码为1.0,如果为假,编码为0.0

属性 描述
pe.has_configuration 如果PE具有负载配置,则为True。
pe.has_debug 如果PE具有Debug部分,则为True。
pe.has_exceptions 如果PE正在使用异常,则为True。
pe.has_exports 如果PE有任何导出符号,则为True。
pe.has_imports 如果PE正在导入任何符号,则为True。
pe.has_nx 如果PE 设置了NX位,则为True。
pe.has_relocations 如果PE具有重定位条目,则为True。
pe.has_resources 如果PE有任何资源,则为True。
pe.has_rich_header 如果存在富标题,则为True。
pe.has_signature 如果PE经过数字签名,则为Ture。
pe.has_tls 如果PE使用TLS,则为True。

然后是64个元素,代表PE入口点函数的前64个字节,每个字节通过将其除以255标准化为[0.0,1.0]—— 这将有助于模型检测那些具有非常独特的入口点的可执行文件,这些入口点在同一个系列的不同样本之间仅略有不同(您可以将其视为一个非常基本的签名):

ep_bytes  =  [0]  *  64
try:
    ep_offset = pe.entrypoint - pe.optional_header.imagebase
    ep_bytes = [int(b) for b in raw[ep_offset:ep_offset+64]]
except Exception as e:
    log.warning("can't get entrypoint bytes from %s: %s", filepath, e)
# ...
# ...
def encode_entrypoint(ep):
    while len(ep) < 64: # pad
        ep += [0.0]
    return np.array(ep) / 255.0 # normalize

然后是二进制文件中ASCII表(因此大小为256)的每个字节重复的直方图 - 该数据点将编码有关文件原

始内容的基本统计信息:

# the 'raw' argument holds the entire contents of the file
def encode_histogram(raw):
    histo = np.bincount(np.frombuffer(raw, dtype=np.uint8), minlength=256)
    histo = histo / histo.sum() # normalize
    return  histo

我决定在特征向量中编码的下一件事是导入表,因为PE使用的API是非常相关的信息:D为了做到这一点,我手动选择了我的数据集中的150个最常见的库,每个PE使用的API将相对库的列加1,创建另一个150个值的直方图,然后通过导入的API总量进行标准化:

# the 'pe' argument holds the PE object parsed by LIEF
def encode_libraries(pe):
    global libraries

    imports = {dll.name.lower():[api.name if not api.is_ordinal else api.iat_address \
                           for api in dll.entries] for dll in pe.imports}

    libs = np.array([0.0] * len(libraries))
    for idx, lib in enumerate(libraries):
        calls = 0
        dll   = "%s.dll" % lib
        if lib in imports:
            calls = len(imports[lib])
        elif dll in imports:
            calls = len(imports[dll])
        libs[idx] += calls
    tot = libs.sum()
    return ( libs / tot ) if tot > 0 else libs # normalize

我们继续编码磁盘上PE大小与内存大小(其虚拟大小)的比率:

min(sz,pe.virtual_size)/ max(sz,pe.virtual_size)

接下来,我们想要编码关于PE部分的一些信息,例如包含代码的部分与包含数据的部分的数量,标记为可执行的部分,每个部分的平均( Shannon entropy)以及它们的大小与其虚拟的平均比率size - 这些数据点将告诉模型PE是否以及如何打包/压缩/混淆:

def encode_sections(pe):
    sections = [{ \
        'characteristics': ','.join(map(str, s.characteristics_lists)),
        'entropy': s.entropy,
        'name': s.name,
        'size': s.size,
        'vsize': s.virtual_size } for s in pe.sections]

    num_sections = len(sections)
    max_entropy  = max([s['entropy'] for s in sections]) if num_sections else 0.0
    max_size     = max([s['size'] for s in sections]) if num_sections else 0.0 
    min_vsize    = min([s['vsize'] for s in sections]) if num_sections else 0.0
    norm_size    = (max_size / min_vsize) if min_vsize > 0 else 0.0

    return [ \
        # code_sections_ratio
        (len([s for s in sections if 'SECTION_CHARACTERISTICS.CNT_CODE' in s['characteristics']]) / num_sections) if num_sections else 0,
        # pec_sections_ratio
        (len([s for s in sections if 'SECTION_CHARACTERISTICS.MEM_EXECUTE' in s['characteristics']]) / num_sections) if num_sections else 0,
        # sections_avg_entropy
        ((sum([s['entropy'] for s in sections]) / num_sections) / max_entropy) if max_entropy > 0 else 0.0,
        # sections_vsize_avg_ratio
        ((sum([s['size'] / s['vsize'] for s in sections]) / num_sections) / norm_size) if norm_size > 0 else 0.0,
    ]

最后,我们将所有碎片粘合到一个大小的矢量中486

v = np.concatenate([ \
    encode_properties(pe),
    encode_entrypoint(ep_bytes),
    encode_histogram(raw),
    encode_libraries(pe),
    [ min(sz, pe.virtual_size) / max(sz, pe.virtual_size)],
    encode_sections(pe)
    ])

return v

剩下要做的唯一事情是告诉我们的模型如何通过自定义先前由ergo创建的prepare.py文件中的prepare_input函数来编码输入样本—— 以下实现支持给定其路径的文件的编码,给定其内容(作为文件上传到ergo API),或者只是对标量特征的原始向量进行评估:

# used by `ergo encode <path> <folder>` to encode a PE in a vector of scalar features
# used by `ergo serve <path>` to parse the input query before running the inference
def prepare_input(x, is_encoding = False):
    # file upload
    if isinstance(x, werkzeug.datastructures.FileStorage):
        return encoder.encode_pe(x)
    # file path
    elif os.path.isfile(x) :
        return encoder.encode_pe(x)
    # raw vector
    else:
        return x.split(',')

现在我们有了将这个转换为以下所需的一切条件:

`0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.333333333333,0.545098039216,0.925490196078,0.41568627451,1.0,0.407843137255,0.596078431373,0.192156862745,0.250980392157,0.0,0.407843137255,0.188235294118,0.149019607843,0.250980392157,0.0,0.392156862745,0.63137254902,0.0,0.0,0.0,0.0,0.313725490196,0.392156862745,0.537254901961,0.145098039216,0.0,0.0,0.0,0.0,0.513725490196,0.925490196078,0.407843137255,0.325490196078,0.337254901961,0.341176470588,0.537254901961,0.396078431373,0.909803921569,0.2,0.858823529412,0.537254901961,0.364705882353,0.988235294118,0.41568627451,0.0078431372549,1.0,0.0823529411765,0.972549019608,0.188235294118,0.250980392157,0.0,0.349019607843,0.513725490196,0.0509803921569,0.0941176470588,0.270588235294,0.250980392157,0.0,1.0,0.513725490196,0.0509803921569,0.109803921569,0.270588235294,0.250980392157,0.870149739583,0.00198567708333,0.00146484375,0.000944010416667,0.000830078125,0.00048828125,0.000162760416667,0.000325520833333,0.000569661458333,0.000130208333333,0.000130208333333,8.13802083333e-05,0.000553385416667,0.000390625,0.000162760416667,0.00048828125,0.000895182291667,8.13802083333e-05,0.000179036458333,8.13802083333e-05,0.00048828125,0.001611328125,0.000162760416667,9.765625e-05,0.000472005208333,0.000146484375,3.25520833333e-05,8.13802083333e-05,0.000341796875,0.000130208333333,3.25520833333e-05,1.62760416667e-05,0.001171875,4.8828125e-05,0.000130208333333,1.62760416667e-05,0.00372721354167,0.000699869791667,6.51041666667e-05,8.13802083333e-05,0.000569661458333,0.0,0.000113932291667,0.000455729166667,0.000146484375,0.000211588541667,0.000358072916667,1.62760416667e-05,0.00208333333333,0.00087890625,0.000504557291667,0.000846354166667,0.000537109375,0.000439453125,0.000358072916667,0.000276692708333,0.000504557291667,0.000423177083333,0.000276692708333,3.25520833333e-05,0.000211588541667,0.000146484375,0.000130208333333,0.0001953125,0.00577799479167,0.00109049479167,0.000227864583333,0.000927734375,0.002294921875,0.000732421875,0.000341796875,0.000244140625,0.000276692708333,0.000211588541667,3.25520833333e-05,0.000146484375,0.00135091145833,0.000341796875,8.13802083333e-05,0.000358072916667,0.00193684895833,0.0009765625,0.0009765625,0.00123697916667,0.000699869791667,0.000260416666667,0.00078125,0.00048828125,0.000504557291667,0.000211588541667,0.000113932291667,0.000260416666667,0.000472005208333,0.00029296875,0.000472005208333,0.000927734375,0.000211588541667,0.00113932291667,0.0001953125,0.000732421875,0.00144856770833,0.00348307291667,0.000358072916667,0.000260416666667,0.00206705729167,0.001171875,0.001513671875,6.51041666667e-05,0.00157877604167,0.000504557291667,0.000927734375,0.00126953125,0.000667317708333,1.62760416667e-05,0.00198567708333,0.00109049479167,0.00255533854167,0.00126953125,0.00109049479167,0.000325520833333,0.000406901041667,0.000325520833333,8.13802083333e-05,3.25520833333e-05,0.000244140625,8.13802083333e-05,4.8828125e-05,0.0,0.000406901041667,0.000602213541667,3.25520833333e-05,0.00174153645833,0.000634765625,0.00068359375,0.000130208333333,0.000130208333333,0.000309244791667,0.00105794270833,0.000244140625,0.003662109375,0.000244140625,0.00245768229167,0.0,1.62760416667e-05,0.002490234375,3.25520833333e-05,1.62760416667e-05,9.765625e-05,0.000504557291667,0.000211588541667,1.62760416667e-05,4.8828125e-05,0.000179036458333,0.0,3.25520833333e-05,3.25520833333e-05,0.000211588541667,0.000162760416667,8.13802083333e-05,0.0,0.000260416666667,0.000260416666667,0.0,4.8828125e-05,0.000602213541667,0.000374348958333,3.25520833333e-05,0.0,9.765625e-05,0.0,0.000113932291667,0.000211588541667,0.000146484375,6.51041666667e-05,0.000667317708333,4.8828125e-05,0.000276692708333,4.8828125e-05,8.13802083333e-05,1.62760416667e-05,0.000227864583333,0.000276692708333,0.000146484375,3.25520833333e-05,0.000276692708333,0.000244140625,8.13802083333e-05,0.0001953125,0.000146484375,9.765625e-05,6.51041666667e-05,0.000358072916667,0.00113932291667,0.000504557291667,0.000504557291667,0.0005859375,0.000813802083333,4.8828125e-05,0.000162760416667,0.000764973958333,0.000244140625,0.000651041666667,0.000309244791667,0.0001953125,0.000667317708333,0.000162760416667,4.8828125e-05,0.0,0.000162760416667,0.000553385416667,1.62760416667e-05,0.000130208333333,0.000146484375,0.000179036458333,0.000276692708333,9.765625e-05,0.000406901041667,0.000162760416667,3.25520833333e-05,0.000211588541667,8.13802083333e-05,1.62760416667e-05,0.000130208333333,8.13802083333e-05,0.000276692708333,0.000504557291667,9.765625e-05,1.62760416667e-05,9.765625e-05,3.25520833333e-05,1.62760416667e-05,0.0,0.00138346354167,0.000732421875,6.51041666667e-05,0.000146484375,0.000341796875,3.25520833333e-05,4.8828125e-05,4.8828125e-05,0.000260416666667,3.25520833333e-05,0.00068359375,0.000960286458333,0.000227864583333,9.765625e-05,0.000244140625,0.000813802083333,0.000179036458333,0.000439453125,0.000341796875,0.000146484375,0.000504557291667,0.000504557291667,9.765625e-05,0.00760091145833,0.0,0.370786516854,0.0112359550562,0.168539325843,0.0,0.0,0.0337078651685,0.0,0.0,0.0,0.303370786517,0.0112359550562,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0561797752809,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0449438202247,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.25,0.25,0.588637653212,0.055703845605`

假设你有一个文件夹包含pe-malicious子文件夹中的恶意样本和pe-legit中的干净样本(随意给它们任何名称,但文件夹名称将成为与每个样本相关联的标签),你可以开始编码过程到一个dataset.csv文件,我们的模型可以使用该文件进行培训:

ergo encode /path/to/ergo-pe-av /path/to/dataset --output /path/to/dataset.csv

喝咖啡放松一下,这个过程可能需要一段时间:),这取决于数据集的大小以及存储磁盘的速度。

向量的有用属性

虽然ergo正在编码我们的数据集,但让我们休息一下,讨论这些向量的有趣属性以及如何使用它。

现在很清楚,结构上或行为上相似的可执行文件将具有相似的向量,其中可以测量与一个向量和另一个向量的距离,例如,通过使用余弦相似性,定义为:

除了其他方面,这个度量标准可用于从数据集中(我要提醒的是,这是一个庞大的文件集合,无论它们是否是恶意的,你并不真正了解其他文件)提取给定族的所有样本,给定一个已知的“轴”样本。例如,假设您有MIPS的Mirai样本,并且您希望从成千上万个不同的未标记样本的数据集中提取任何体系结构的每个Mirai变体。

我在sum数据库中对诸如findSimilar “oracle”存储过程的一个奇特名称)执行的算法非常简单:

// Given the vector with id="id", return a list of
// other vectors which cosine similarity to the reference
// one is greater or equal than the threshold.
// Results are given as a dictionary of :
//      "vector_id => similarity"
function findSimilar(id, threshold) {
    var v = records.Find(id);
    if( v.IsNull() == true ) {
        return ctx.Error("Vector " + id + " not found.");
    }

    var results = {};
    records.AllBut(v).forEach(function(record){
        var similarity = v.Cosine(record);
        if( similarity >= threshold ) {
           results[record.ID] = similarity
        }
    });

    return results;
}

但相当有效:

ANN作为黑匣子和训练

同时,我们的编码器应该已经完成了它的工作,并且生成包含从每个样本中提取的所有标记向量的dataset.csv文件,此时应该可以用于训练我们的模型......但是“训练我们的模型”实际上意味着什么?这个“模型”首先是什么?

我们使用的模型是一种称为人工神经网络的计算结构,我们使用Adam优化算法进行训练。在网上你会找到两者非常详细和正式的定义,但底线是:

ANN是一个“盒子”,包含数百个数值参数(“神经元” 的“权重”,按层组织),它们与输入的(我们的向量)相乘并组合以产生输出预测。培训过程包括向系统提供数据集、根据已知标签检查预测、少量更改这些参数、观察这些变化是否以及如何影响模型准确性并重复此过程达给定次数(时期)直到整体性能达到我们定义的所需最小值。

来自nature.com的致谢)

主要假设我们未知的数据集中的数据点之间存在数值关联,但如果已知数据集,我们将可以把该数据集划分为输出类。我们要做的是要求黑盒子摄取数据集并通过迭代调整其内部参数使其近似于这样的函数。

model.py文件中你可以找到ANN的定义,这是一个完整连接的网络,每个隐藏层有70个神经元,ReLU作为激活函数,在训练期间丢失30%:

n_inputs = 486

return Sequential([
    Dense(70, input_shape=(n_inputs,), activation='relu'),
    Dropout(0.3),
    Dense(70, activation='relu'),
    Dropout(0.3),
    Dense(2, activation='softmax')
])

我们现在可以开始培训过程:

ergo train /path/to/ergo-pe-av --dataset /path/to/dataset.csv

根据CSV文件中向量的总量,此过程可能需要几分钟到几小时甚至几天。如果你的机器上有GPU,ergo会自动使用它们而不是CPU核心,以便显着加快训练速度(如果你感到困惑,请查看这篇文章)。

完成后,你可以使用以下方法检查模型性能统计信息:

ergo view /path/to/ergo-pe-av

这将显示培训历史,我们可以验证模型的准确性是否确实随着时间的推移而增加(在我们的例子中,它在epoch30周围达到了97%的准确度)和ROC曲线,它告诉我们模型如何有效地区分恶意与否(AUC,或者说曲线下的区域,为0.994,意味着模型非常好):

此外,还将显示每个培训、验证和测试集的混淆矩阵。左上角的对角线值(深红色)代表正确预测的数量,而其他值(粉色)则是错误的(我们的模型在大约30000个样本的测试集中有1.4%的误报率):

考虑到我们的特征提取算法的简单性,这样一个大数据集的97%准确度是一个非常有趣的结果。许多错误检测都是由UPX(或者甚至只是自解压zip / msi档案)这样的打包程序引起的,这些打包程序会影响我们正在编码的一些数据点 - 添加解包策略(例如模拟解包存根直到真正的PE处于内存)和更多功能(更大的入口点矢量,动态分析跟踪被调用的API,想象力是极限!)是获得99%的关键:)

结论

我们现在可以删除临时文件:

ergo clean /path/to/ergo-pe-av

加载模型并将其用作API:

ergo serve /path/to/ergo-pe-av --classes "clean, malicious"

并要求客户端分类:

curl -F "x=@/path/to/file.exe" "http://localhost:8080/"

您将收到如下响应(此处正在扫描的文件):

该模型将样本检测为恶意样本,置信度超过99%。

现在您可以使用该模型扫描您想要的任何内容,enjoy!:)

点击收藏 | 0 关注 | 1
  • 动动手指,沙发就是你的了!
登录 后跟帖