Posted by:

David Greenwood

David Greenwood, Chief of Signal

If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly. Please view the post on signalscorps.com for the full interactive viewing experience.

In this post I will describe some of the problems I wanted to solve building file2stix, the research that went into it, and jump into the basics of how intelligence reports are ingested.

More-and-more organisations are standardising the way the represent threat intelligence using the STIX 2.1 data model. Many of them already use our existing cve2stix API for CVE data represented as STIX 2.1 Objects.

As a result, an increasing number of SIEMs, SOARs, TIPs, etc. have native STIX 2.1 support.

However, authoring STIX 2.1 content can be laborious. I have seen analysts manually copy and paste data from reports, blogs, emails, and other sources into STIX 2.1 Objects.

In many cases these Observables (IOCs) can be automatically detected in plain text using defined patterns.

For example, an IPv4 observable has a specific pattern that can be identified using regular expressions. This regular expression will match an IPv4 observable;

^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$

Similarly, the following regular expression will capture URLs;

^(https?|ftp|file)://.+$

Both of these examples (here and here, respectively) are taken from the brilliant Regular Expressions Cookbook (2nd edition) by Jan Goyvaerts and Steven Levithan.

Now this isn’t rocket science, and indeed there are already quite a few open source tools that contain regular expressions for extracting Observables in this way;

  • IoC extractor: An npm package for extracting common IoC (Indicator of Compromise)
  • IOC Finder: Simple, effective, and modular package for parsing Observables (indicators of compromise (IOCs), network data, and other, security related information) from text.
  • cacador: Indicator Extractor
  • iocextract: Defanged Indicator of Compromise (IOC) Extractor.
  • Cyobstract: A tool to extract structured cyber information from incident reports.

However, only one, /ninoseki/ioc-extractor, supports STIX output and it is somewhat limited in the STIX objects it supports.

The idea for file2stix was born

I wanted a tool that took the good parts of all of these products and build them into a single API based product that;

  1. takes a file input,
  2. parses out observables using customisable regular expressions
  3. creates STIX 2.1 Objects (and STIX Bundles from all Objects in a report) from these extractions
  4. allows users to retrieve these STIX 2.1 Object via an API interface

The main aim being to remove the tedious data entry often performed by Intel Analysts workload freeing them up to put their skills to work.

Some example use-cases for file2stix will solve for include:

  • Automatically converting IoC feeds to STIX format
  • Identifying observables in webpages to ingest into downstream tools
  • Quickly identify MITRE ATT&CK and MITRE CAPEC context from reports
  • Extracting detection rules from text (YARA and SIGMA)

file2stix is designed to generate threat intelligence that can be reviewed and worked inside other tooling, like a threat intelligence platform.

And equally as important, what file2stix is not (at least for the MVP):

  • most extractions in file2stix are based on regular expressions. This works perfectly for pattern matching, but not at all when dealing with logical semantics.
  • Able to detect complex relationships between more than one object. All extractions in file2stix have single relationships back to the original report (vs. relationships between indicator -> malware objects -> infrastructure objects, for example).

The logic of file2stix can be broken down into three parts for explanation over the following series of blog posts;

  1. Inputs (this post): the types of files file2stix will accept, and how they will be pre-processed
  2. Extractions: how data will be extracted from inputs and transformed into outputs (STIX 2.1 objects)
  3. Retrieval: how user obtains the created STIX 2.1 for their input via an API.

In a forth and final post I will also explain how we host the web version of file2stix.

To give a high-level overview of how file2stix works:

  1. User inputs supported filetype (and sets configoration options)
  2. file2stix validates input
  3. If checks pass, file2stix converts (if required) input to text file
  4. Text file is compared to regular expressions/lookup files to identify observables
  5. STIX 2.1 Objects created for each observable identified in text (other STIX 2.1 objects are also created to represent user input)
  6. STIX 2.1 Bundle written containing all objects for inputted report

I’ll explain each step in much more detail throughout the series of blog posts, but lets start with inputs…

file2stix Inputs

You can upload a range of filetypes to file2stix.

Some filetypes are pre-processed to create a plaintext file before any data held inside them is reviewed for extraction as follows:

  • Plain text (.txt): Plain text files are not processed. All content in a .txt file is considered.
  • CSV (.csv): CSV files are not processed. All content in a .csv file is considered.
  • HTML (.html): All HTML tags are stripped from the parsed text before extractions are run. For example, <a href="URL_INSIDE_HTML_TAG">PRINTED_URL</a>, only PRINTED_URL would remain for extraction pattern matching. Note, html can get very messy. We generally recommend using a html to pdf tool (e.g. printfriendly or similar) and uploading the page as a pdf (or whatever less messy file structure you covert to) for best results.
  • Markdown (.md, .markdown): Markdown can contain HTML. If HTML elements are detected, these are stripped (in the same way as done for HTML inputs). Only content between HTML tags is considered. For example, <a href="URL_INSIDE_HTML_TAG">PRINTED_URL</a>, only PRINTED_URL would remain for extraction pattern matching. If no HTML is detected, or there is non-HTML content in the file, this content is not pre-processed.
  • PDF (.pdf): Only printed text in a pdf is considered for matching.
  • XML (.xml): All XML tags are stripped from the parsed text before extractions are run. For example, <url = "URL_INSIDE_XML_TAG">PRINTED_URL</url>, only PRINTED_URL would remain for extraction pattern matching
  • JSON (.json): Only values are considered for json inputs. For example; {"1.1.1.1": "SOME_IP"}, only SOME_IP would remain for matching.
  • Microsoft Word (.doc, .docx): Only printed text in a Word doc is considered for matching.
  • Microsoft Excel (.xls, .xlsx): Only printed text in a Excel doc is considered for matching (formulas and scripts are ignored).
  • YAML (.yml, .yaml): YAML files are not processed. All content in a YAML file is considered.
  • YARA (.yar, .yara): All content in yara filetypes is considered. However, this input is designed for importing single yara rules only. Generally a single yara rule.

Pre-processing creates a text file that is then considered for extraction. There are many existing Python libraries that convert rich files into plain text, for example pdfPlumber for turning PDFs into text files.

A note on images

Currently file2stix does not process any images found inside inputted documents, although this is clearly an improvement can be made here.

I’ve noticed that many blogs implement text content in image format, for example, tables inserted as image files. My assumption being that this is done to protect from exactly what I’m trying to do with file2stix – extract the content for use elsewhere.

There is lots of open-source OCR tools available that work fairly well for paragraph text, however, issues start to arise when data is formatted in another structured way (e.g. tabular content).

That said, even this data is fairly to parse. Though the bottom line is many of these tools need a bit more refinement to ensure consistency hence I decided to keep the feature out of the scope for minimum viable product.

A note on security

At this point it’s important to also consider the security implications of allowing file uploads. file2stix uses a range of security controls recommended by OPSWAT to reduce the security risks;

  • Only allows specified file types: file2stix only allows the supported filetypes described above
  • Verifies file types: checks that the filetype is actually correct and is not being masked
  • Sets a maximum name length and maximum file size: users can modify this in the file2stix configoration file, but there are default limted that will be accepted
  • Randomises uploaded file names: file2stix stores the original upload and the plaintext file (one is created) using randomised filenames (so they cannot easily be remotely executed)
  • Store uploaded files outside the web root folder: users can choose to store files remotely from the server (currently S3 is supported in file2stix)

Input file process: Defanging

Fanging obfuscates indicators into a safer representations so that a user reading a report does not accidentally click on a malicious URL or inadvertently run malicious code. Many cyber threat intelligence reports shared electronically employ fanging.

Typical types of fanged Observables include IPv4 addresses (e.g. 1.1.1[.]1), IPv6 addresses (e.g. 2001:0db8:85a3:0000:0000:8a2e:0370[:]7334), domain names (e.g. example[.]com), URLs (e.g. https[:]//example.com/research/index.html), email addresses (e.g. example[@]example.com), file extensions (e.g. malicious[.]exe), and directory paths (e.g. [C:]\\Windows\\System32).

Unfortunately, there is no universal standard for fanging, although there are some common methods. Some samples of fanging I have observed include the following:

  • Wrapping one or more special characters in [ ]
    • e.g. www[.]example[.]com
    • e.g. http[:]//example.com
    • e.g. http[://]example.com
    • e.g. 1.1.1.1[/]24
  • Wrapping one or more special characters in { }
  • Wrapping one or more special characters in ( )
  • Prefixing one or more special characters with [
    • e.g. www[.example[.com
    • e.g. http[://example.com
    • e.g. http[://example.com
  • Prefixing one or more special characters with \
  • Replacing http and hxxp
    • e.g. hxxps://google.com
  • Replacing . with ` dot `
  • Replacing . with [dot] (or (dot), or {dot})
  • Replacing @ with ` at `
    • e.g. example at example.com
  • Replacing @ with [at] (or (at), or {at})
    • e.g. example[at]example.com

A combination of the above techniques are also commonly implemented for defanging. For example replacing . with ` dot ` and replacing @ with ` at ` for an email like so; fanged = example at example dot com, defanged = [email protected]

Another example using even more fanging technique combinations for a URL; fanged = hxxps[:]//test\.example[.)com[/]path, defanged = https://test.example.com/path

As a result of this complexity, file2stix does not defang the data by default. A user can explicitly set file2stix to defang data. This is achieved in a simple way technically by file2stix – file2stix simpy uses find and replace in the following order:

  • replaces the text {dot} with [dot]
  • replaces the text [dot] with [.]
  • replaces the text {at} with [at]
  • replaces the text [at] with [@]
  • removes the square bracket characters ([ and ])
    • with one exception; for ipv6 and port observables, the representation is slightly different. An IPv6 and port is usually denoted as follows; [2001:0db8:85a3:0000:0000:8a2e:0370:7334]:80. As such, when an ipv6 is detected in this format, the removal of the square brackets is not carried out for the ipv6.

When set to defang mode, all STIX SCO Objects (covered later) will contain the property "defanged": true. If not set, the defanged property will not be included.

For example;

{
    "type": "email-addr",
    "spec_version": "2.1",
    "id": "email-addr--0e81edfb-25a5-5cb0-8cb0-fe9c5efb6257",
    "value": "[email protected]`",
    "defanged": true
}

file2stix Extraction

file2stix extracts observables from text and translates them into STIX 2.1 Objects.

There are two main extraction types performed;

  1. Default extractions
    • Regular expressions
    • Lookups
  2. Custom extractions
    • Lookups

I’ll explain more about these in the next post. However, first I will walk-through the generic STIX 2.1 objects created for all inputted files (or in file2stix terms; reports)…

Identities (identity) SDOs

file2stix assigns a created_by_ref property to all SDOs and SROs.

By defaults a file2stix identity is used.

{
    "type": "identity",
    "spec_version": "2.1",
    "id": "identity--acf55024-6bbe-486f-a27a-7967559324f4",
    "created_by_ref": "identity--a99a2297-3044-4011-9a7e-2ff15e056b65",
    "created": "2022-01-01T00:00:00.000Z",
    "modified": "2022-01-01T00:00:00.000Z",
    "name": "file2stix",
    "description": "https://github.com/signalscorps/file2stix/",
    "identity_class": "organization",
    "sectors": [
        "technology"
    ],
    "contact_information": "https://www.signalscorps.com/contact/",
    "object_marking_refs": [
        "marking-definition--613f2e26-407d-48c7-9eca-b8e91df99dc9"
    ]
}

Report SDOs (report)

All individual data sources ingested or uploaded are represented as a unique STIX Report SDO that take the following structure;

{
    "type": "report",
    "spec_version": "2.1",
    "id": "report--<GENERATED BY STIX2 LIBRARY>",
    "created_by_ref": "identity--<SIGNALS CORPS IDENTITY ID>",
    "created": "<ITEM INGEST DATE>",
    "modified": "<ITEM INGEST DATE>",
    "name": "File converted: <FILENAME>",
    "published": "<ITEM INGEST DATE>",
    "report_types": ["threat-report"],
    "object_marking_refs": [
        "marking-definition--<TLP LEVEL SET>"
    ],
    "labels": [
        "<LABELS ADDED BY USER>"
    ],
    "object_refs": ["<LIST OF ALL EXTRACTED OBJECTS>"],
    "external_references": [
        {
            "source_name": "file2stix",
            "external_id": "report--<REPORT OBJECT ID>",
            "description": "This object was created using file2stix from the Signals Corps for report--<REPORT OBJECT ID>, filename <FILENAME>.",
            "url": "https://<HOST>/reports/report--<REPORT OBJECT ID>"
        }
    ]
}

Note, the object_refs contains all references that are referenced by objects in the report (SDOs, SROs, SCOs). This includes extracted objects (i.e. Indicator SDOs, Vulnerability SDOs, Software SCOs, Relationship SROs etc.).

Relationship SROS (relationship)

There are two types of extractions in file2stix; default and custom (linked to extraction types briefly mentioned earlier).

Default extractions (Report and SDO)

In the case of default extractions, a Relationship between the extracted Object and Report SDO is created with the relationship_type equal to default-extract.

The created and modified dates match those in the linked Report Object.

Here is the structure of the SRO for default extractions;

{
    "type": "relationship",
    "spec_version": "2.1",
    "id": "relationship--<GENERATED BY STIX2 LIBRARY>",
    "created_by_ref": "identity--<SIGNALS CORPS IDENTITY ID>",
    "created": "<REPORT CREATED DATE>",
    "modified": "<REPORT CREATED DATE>",
    "relationship_type": "default-extract-from",
    "source_ref": "<EXTRACTED STIX OBSERVABLE ID>",
    "target_ref": "report--<REPORT OBJECT>",
    "object_marking_refs": [
        "marking-definition--<TLP LEVEL SET>"
    ],
    "external_references": [
        {
            "source_name": "file2stix",
            "external_id": "report--<REPORT OBJECT ID>",
            "description": "This object was created using file2stix from the Signals Corps for report--<REPORT OBJECT ID>, filename <FILENAME>.",
            "url": "https://<HOST>/reports/report--<REPORT OBJECT ID>"
        }
    ]
}

Custom extractions (Report and SDO)

Custom extractions have slightly different Relationship Objects created where the relationship_type equal to custom-extract, as follows;

{
    "type": "relationship",
    "spec_version": "2.1",
    "id": "relationship--<GENERATED BY STIX2 LIBRARY>",
    "created_by_ref": "identity--<SIGNALS CORPS IDENTITY ID>",
    "created": "<REPORT CREATED DATE>",
    "modified": "<REPORT CREATED DATE>",
    "relationship_type": "custom-extract-from",
    "source_ref": "<EXTRACTED STIX OBSERVABLE ID>",
    "target_ref": "report--<REPORT OBJECT>",
    "object_marking_refs": [
        "marking-definition--<TLP LEVEL SET>"
    ],
    "external_references": [
        {
            "source_name": "file2stix",
            "external_id": "report--<REPORT OBJECT ID>",
            "description": "This object was created using file2stix from the Signals Corps for report--<REPORT OBJECT ID>, filename <FILENAME>.",
            "url": "https://<HOST>/reports/report--<REPORT OBJECT ID>"
        }
    ]
}

SCO and SDO relationships

Many extractions that create Indicator Objects also create one or more SCOs. These are joined like so;

{
    "type": "relationship",
    "spec_version": "2.1",
    "id": "relationship--<GENERATED BY STIX2 LIBRARY>",
    "created_by_ref": "identity--<SIGNALS CORPS IDENTITY ID>",
    "created": "<REPORT CREATED DATE>",
    "modified": "<REPORT CREATED DATE>",
    "relationship_type": "pattern-contains",
    "source_ref": "indicator--<EXTRACTED STIX INDICATOR ID>",
    "target_ref": "<EXTRACTED STIX SCO ID>",
    "object_marking_refs": [
        "marking-definition--<TLP LEVEL SET>"
    ],
    "external_references": [
        {
            "source_name": "file2stix",
            "external_id": "report--<REPORT OBJECT ID>",
            "description": "This object was created using file2stix from the Signals Corps for report--<REPORT OBJECT ID>, filename <FILENAME>.",
            "url": "https://<HOST>/reports/report--<REPORT OBJECT ID>"
        }
    ]
}

A note on TLPs

The default TLP level assigned to all objects created using cve2stix is TLP:WHITE. This will mark all objects generated using file2stix (except SCOs, the file2stix Identity SDO, and any imported ATT&CK or CAPEC STIX objects) with the correct STIX marking definition (marking-definition--613f2e26-407d-48c7-9eca-b8e91df99dc9)

file2stix allows users to also specify TLP:WHITE TLP:GREEN, TLP:AMBER, or TLP:RED when adding Reports. Setting a TLP level manually will change the object_marking_refs for SDOs and SROs as follows;

  • TLP:WHITE (marking-definition--613f2e26-407d-48c7-9eca-b8e91df99dc9)
  • TLP:GREEN (marking-definition--34098fce-860f-48ae-8e50-ebd3cc5e41da)
  • TLP:AMBER (marking-definition--f88d31f6-486f-44da-b317-01333bde0b82)
  • TLP:RED (marking-definition--5e57c739-391a-4eb3-b6be-7d15ca92d5ed)

A note on Confidence scoring

file2stix also optionally allows users to add the confidence property to the original Report SDO created. Users can set a confidence between 1 (lowest) - 100.

Consult the STIX 2.1 Specification for more information if you’re unfamiliar with confidence scoring.

A note on labels

file2stix also optionally allows users to add labels (aka tags) to the original Report SDO created. Users can add labels up to 30 characters using characters a-z, 1-9, _, and -. Labels are always converted to lowercase, so label inputs can be considered case insensitive.

Up next: Extractions

Now you’re familiar with the general concepts of what file types can be used with file2stix and some of the basics as to how they’re processed, in the next post I will jump into the deep end with regards to extractions and extraction logic.




Discuss this post


Signals Corps Slack

Never miss an update


Sign up to receive new articles in your inbox as they published.