Posted by:

David Greenwood

David Greenwood, Chief of Signal

If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly. Please view the post on for the full interactive viewing experience.

In this post I will describe some of the problems I wanted to solve building file2stix, the research that went into it, and a few tips to get up and running with the tool.

More-and-more organisations are standardising the way the represent threat intelligence using the STIX 2.1 data model.

As a result, an increasing number of SIEMs, SOARs, TIPs, etc. have native STIX 2.1 support.

However, authoring STIX 2.1 content can be laborious. I have seen analysts manually copy and paste data from reports, blogs, emails, and other sources into STIX 2.1 Objects.

In many cases these Observables (IOCs) can be automatically detected in plain text using defined patterns.

For example, an IPv4 observable has a specific pattern that can be identified using regular expressions. This regular expression will match an IPv4 observable;


Similarly, the following regular expression will capture URLs;


Both of these examples (here and here, respectively) are taken from the brilliant Regular Expressions Cookbook (2nd edition) by Jan Goyvaerts and Steven Levithan.

Now this isn’t rocket science, and indeed there are already quite a few open source tools that contain regular expressions for extracting Observables in this way;

  • IoC extractor: An npm package for extracting common IoC (Indicator of Compromise)
  • IOC Finder: Simple, effective, and modular package for parsing Observables (indicators of compromise (IOCs), network data, and other, security related information) from text.
  • cacador: Indicator Extractor
  • iocextract: Defanged Indicator of Compromise (IOC) Extractor.
  • Cyobstract: A tool to extract structured cyber information from incident reports.

However, only one, /ninoseki/ioc-extractor, supported STIX output and it is somewhat limited in the STIX objects it supports.

Introducing file2stix

file2stix aimed to take the good parts of all of these products and build them into a single command line tool that;

  1. takes a file input,
  2. parses out observables using customisable regular expressions, and
  3. creates STIX 2.1 Objects (and STIX Bundles from all Objects in a report) from these extractions

file2stix offers 3 modes;

  1. analysis (default): used during research to create STIX 2.1 Objects from general threat research
  2. observed: same as analysis mode, but also allows you to keep track of how many times an extraction is observed in a report
  3. sighting: used to denote that extractions from a report represent real instances of an observable being seen in your environment (generally used for log file inputs)

Out-of-the-box file2stix supports over 30 unique observable extraction regular expressions, including IP addresses, cryptocurrency, and MITRE ATT&CK STIX 2.1 Objects.

For MITRE knowledge-bases; the following ATT&CK data types from the Enterprise, Mobile and ICS matrices are supported;

  • Techniques (attack-pattern)
  • Sub-Technique (attack-pattern)
  • Tactic (x-mitre-tactic)
  • Course of Action (course-of-action)
  • Intrusion Set (intrusion-set)
  • Malware (malware)
  • Tool (tool)
  • Data Sources (x-mitre-data-source)

And for CAPEC;

  • CAPEC (attack-pattern)

You can also add you own custom extractions and map them to STIX 2.1 Objects too using keyword matches.

One of the key features of file2stix is allowing for assignment of TLP to extracted objects to ensure resulting objects are shared correctly. file2stix uses the default STIX 2.1 TLP marking definitions to do this.

In many cases, inputted reports have fanged date (e.g. 1[.]1[.]1[.]1). Defanging obfuscates indicators into a safer representations so that a user reading a report does not accidentally click on a malicious URL or inadvertently run malicious code. Unfortunately, there is no universal standard for defanging, although there are some common methods. file2stix does not convert (and thus extract) fanged data by default, but can be instructed to do so.

It’s also possible to assign a confidence score to extracted data to convey the confidence in the reports findings. The STIX 2.1 confidence property is used to do this.

What file2stix is not

file2stix was designed to remove the tedious data entry often performed by Intel Analysts workload freeing them up to put their skills to work.

As such, this approach isn’t perfect; it will often create benign extractions. To counter this, file2stix allows for the use of MISP Warning Lists and custom Warning lists to flag erroneous extractions.

All extractions in file2stix are based on regular expressions. This works perfectly for pattern matching, but not at all when reading semantics.

There are a couple of open-source tools out there that solve this problem, MITRE TRAM is a good example. TRAM takes a report as an input and identifies ATT&CK Tactics and Techniques being discussed through NLP.

Similarly, file2stix is not smart enough to understand complex relationships between more than one object. All extractions in file2stix have single relationships back to the original report (vs. relationships between indicator and malware objects, for example).

In short, the output of file2stix is not designed to generate threat intelligence ready for dissemination. It does however generate threat intelligence that can be reviewed and worked with in other tooling.

Try file2stix now

Hopefully this post has given you a little insight into why I built file2stix and how you can use it.

To really understand the power of file2stix, take a look at the user documentation here

file2stix is available to download on Github here.

I hope you find it useful, and I am always very happy to receive feedback either via Github issues, or directly via our Slack community.

Discuss this post

Signals Corps Slack

Never miss an update

Sign up to receive new articles in your inbox as they published.