Posted by:

David Greenwood

David Greenwood, Chief of Signal

If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly. Please view the post on signalscorps.com for the full interactive viewing experience.

In this post I will show you how Signals Corps products use ArangoDB to store STIX 2.1 Objects.

As I demonstrated in our STIX 2.1 tutorial, STIX is a very comprehensive way to represent cyber threat intelligence. Not only does it offer a lot of options to model data, it is also a widely used standard.

However, both these factors bring downsides. Being a defined standard means I cannot be too creative (to ensure maximum downstream compatibility). Similarly the flexibility to model relationships and connect STIX Objects can make for a complex graph of information.

When it comes to storing and retrieving STIX 2.1 data in an easy and efficient manner there are number of considerations when selecting a database to use.

In building Signal Corps products (which are all built around the STIX 2.1 specification), I reviewed various relational databases, document-oriented databases, graph databases and hybrid graph/document databases.

I finally settled on ArangoDB. ArangoDB is a native multi-model database system supporting three data models: key/value, documents and graphs.

The team at Sekoia went through a similar process and have done a great job detailing their decision to use ArangoDB to store STIX Objects too.

In this post, I will show you how to get started storing and retrieving STIX 2.1 Objects using ArangoDB.

You can download ArangoDB here.

For this tutorial, I will be heavily using the the ArangoDB Web UI to demonstrate the concepts.

Storing STIX 2.1 Objects

Firstly, I need to create two Collections;

  1. A Document Collection stix_objects to store the SDOs and SCOs. Document Collections are used to store vertex documents.
  2. An Edge Collection stix_relationships to store the SROs and embedded relationships (e.g. object_refs). Edges are special documents used for connecting other documents into a graph. An edge describes the connection between two documents using the internal attributes: _from and _to.

Create STIX Collection in ArangoDB

STIX Collections in ArangoDB

Using these two collections I will also need a Graph, stix_graph:

  • stix_relationships as the Edge Definition (edge definition define a relation of the graph)
  • stix_objects for both the fromCollections (Collections that contain the start vertices of the relation) and toCollections (Collections that contain the end vertices of the relation)

STIX Graph in ArangoDB

Now the Collections and Graph are defined I can start populating the database with some STIX 2.1 data.

In this example I will create two SDOs and link them together with an SRO (taken from this OASIS STIX example bundle)

ArangoDB provides its own query language named Arango Query Language (AQL) which will allow me to do this:

AQL is mainly a declarative language, meaning that a query expresses what result should be achieved but not how it should be achieved. AQL aims to be human-readable and therefore uses keywords from the English language.[…] Further design goals of AQL were the support of complex query patterns and the different data models ArangoDB offers.

The syntax of AQL queries is different to SQL, even if some keywords overlap. Nevertheless, AQL should be easy to understand for anyone with an SQL background.

I will start by creating two SDOs, one Indicator (indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f) and one Malware (malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b);

LET objects = [
  {
    "_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
    "type": "indicator",
    "id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
    "created": "2016-04-06T20:03:48.000Z",
    "modified": "2016-04-06T20:03:48.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "labels": [
      "malicious-activity"
    ],
    "name": "Poison Ivy Malware",
    "description": "This file is part of Poison Ivy",
    "pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
    "valid_from": "2016-01-01T00:00:00Z"
  },
  {
    "_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
    "type": "malware",
    "id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
    "created": "2016-04-06T20:07:09.000Z",
    "modified": "2016-04-06T20:07:09.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "name": "Poison Ivy"
  }
]
FOR object IN objects
 INSERT object INTO stix_objects

In the query I use the STIX id attribute as the document primary key (_key) so I can easily retrieve the documents later.

STIX SDO AQL Write ArangoDB

As you can see in the screenshot above, this successfully executes two writes to the stix_objects Document Collection.

Now I need to store the SRO relationship between these Objects in the Edge Collection stix_relationships. Three attributes must be set in the AQL query for this: _key, _from and _to.

For the _key I am going to use STIX SRO ID in the same way I did for the SDOs. For the _from and _to field I will use the STIX SRO properties source_ref and target_ref respectively. The format to use for the _from and _to fields is as follows: <collection>/<_key>. Where <collection> is the Collection is the Document Collection name (stix_objects) and the <_key> is the _key SDO ID (I set _key as the STIX Object id, so the _key value is equal to the STIX Object id)

To demonstrate, the AQL request for this example is as follow:

INSERT {
  "_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad+2016-04-06T20:06:37.000Z",
  "_from": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
  "_to": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
  "type": "relationship",
  "id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
  "created": "2016-04-06T20:06:37.000Z",
  "modified": "2016-04-06T20:06:37.000Z",
  "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81"
  "relationship_type": "indicates",
  "source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
  "target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
} IN stix_relationships

STIX SRO AQL Write ArangoDB

STIX also has embedded relationships defined in each SDO, e.g. created_by_ref, sighting_of_ref, observed_data_refs, object_refs, etc.

An SDO might have more than one embedded relationship property (identified where the property name ends with _refs. Embedded relationships can also contains lists of relationships. Here is an example of a Report SDO with three embedded relationships under the objects_refs property:

{
  "type": "report",
  "spec_version": "2.1",
  "id": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
  "created": "2015-12-21T19:59:11.000Z",
  "modified": "2015-12-21T19:59:11.000Z",
  "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
  "name": "The Black Vine Cyberespionage Group",
  "description": "A simple report with an indicator and campaign",
  "published": "2015-12-21T19:59:11.000Z",
  "report_types": ["campaign"],
  "object_refs": [
    "indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
    "campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
    "relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a"
  ]
}

These embedded relationships need to be represented as edges too, but in a slightly different way to native STIX Objects. In the previous examples I added pure STIX Objects into ArangoDB (as they would be received), however, embedded relationships require a custom object (non STIX Object) to define the relationship that in turn needs to be parsed out of a STIX Object.

In our implementation, embedded relationship edges have the _key containing both the STIX Object IDs for the embedded relationship, joined with a +. The edges also contain two unique properties; 1) type which is always embedded-relationship, and 2) relationship_description property which is equal to the property in the STIX Object.

Let me demonstrate for clarity. First I would create the Report SDO as a Document;

INSERT {
  "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
  "type": "report",
  "id": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
  "created": "2015-12-21T19:59:11.000Z",
  "modified": "2015-12-21T19:59:11.000Z",
  "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
  "name": "The Black Vine Cyberespionage Group",
  "description": "A simple report with an indicator and campaign",
  "published": "2015-12-21T19:59:11.000Z",
  "report_types": ["campaign"],
  "object_refs": [
    "indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
    "campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
    "relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a"
  ]
} IN stix_objects

This report has four embedded relationships;

  • created_by_ref: identity--c2aceda2-0e46-431d-be40-7b4a4e797f81
  • object_refs: indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2
  • object_refs: campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c
  • object_refs: relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a

I will assume these SDOs already exist in the stix_objects document collection.

As such, all that that is required is to create the four embedded relationship edges in the stix_relationships Edge Collection.

LET embedded_relationships  = [
  {
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
    "type": "embedded-relationship",
    "relationship_description": "created_by_ref",
    "deprecated" false
  },
  {
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+identity--a463ffb3-1bd9-4d94-b02d-74e4f1658283",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/identity--a463ffb3-1bd9-4d94-b02d-74e4f1658283",
    "type": "embedded-relationship",
    "relationship_description": "object_refs",
    "deprecated" false
  },
  {
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
    "type": "embedded-relationship",
    "relationship_description": "object_refs",
    "deprecated" false
  },
  {
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a",
    "type": "embedded-relationship",
    "relationship_description": "object_refs",
    "deprecated" false
  }
]
FOR embedded_relationship IN embedded_relationships
 INSERT embedded_relationship INTO stix_relationships

Now I can filter the graph by type, either equal to relationship for SROs or embedded-relationship for relationships embedded in STIX SDOs.

As described in the 105 Versioning tutorial, STIX 2.1 Objects can go through minor updates. A minor update can contain updated, removed or new property field/values.

With each UPDATE to a document, ArangoDB tracks its revision under a _rev (Document Revision) property. However, relying on this alone does not allow for the ability to easily track revision history (e.g. for an audit trail of how the document changed).

To ensure an audit trail is kept, we can add a new fields _version where the property is the time of INSERT/UPDATE. This means the latest version of the object will always have the highest time.

For example, lets start create a new report object with the _version key;

INSERT {
  "_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
  "_version": "2020-01-01T00:00:00.000Z",
  "type": "report",
  "id": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
  "created": "2021-01-01T00:00:00.000Z",
  "modified": "2021-01-01T00:00:00.000Z",
  "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
  "name": "Demoing version",
  "published": "2021-01-01T00:00:00.000Z",
  "report_types": ["campaign"]
} IN stix_objects

Then I will update it, with a new _version time, along with the additional fields;

UPDATE {
  "_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
  "_version": "2022-04-08T09:23:23.000Z",
  "type": "report",
  "id": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
  "created": "2021-01-01T00:00:00.000Z",
  "modified": "2022-01-01T00:00:00.000Z",
  "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
  "name": "Demoing version",
  "description": "Adding a new field",
  "published": "2021-01-01T00:00:00.000Z",
  "report_types": ["campaign"]
} IN stix_objects

So now I have two revisions (_revs) of the object with "_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1". By default the latest revision will be returned. Though there’s a problem with this approach;

On disk, it is therefore possible that multiple revisions of the same document (as identified by the same _key value) exist at the same time. However, stale revisions are not accessible.

Source: ArangoDB docs

Therefore a new object needs to be created to ensure the old STIX objects for the ID are still accessible.

In addition to making the update we need to write a new object to the database to represent the old object version that has just been overwritten. To do this, the key can take the form id+new_version_value.

For example, in the above example, the original version of the object would be recreated as follows;

INSERT {
  "_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1+2022-04-08T09:23:23.000Z",
  "_version": "2020-01-01T00:00:00.000Z",
  "type": "report",
  "id": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
  "created": "2021-01-01T00:00:00.000Z",
  "modified": "2021-01-01T00:00:00.000Z",
  "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
  "name": "Demoing version",
  "published": "2021-01-01T00:00:00.000Z",
  "report_types": ["campaign"]
} IN stix_objects

This same logic can be applied to any object, including relationships (because the latest object ID will always be the same).

Similarly, if any new embedded relationship (_ref or _refs) properties are added to the updated object in the Document Collection, these new relationships will be created in ArangoDB in the same way as described earlier.

If an embedded relationship property is removed, the embedded relationship object in the edge collection is updated so that the deprecated property equals true.

Historic STIX 2.1 Relationships Objects (NOT embedded relationships) in the Edge Collection do not need to be updated as they should always point to the latest version but retain their original created and modified times. If a user wants to add, update or remove these, they should do so explicitly using a separate UPDATE.

Querying STIX 2.1 Objects

Now these STIX Objects and relationships have successfully been written, I can start to explore the database.

The ArangoDB UI contains a graph visualisation tool that is useful for browsing visually (Graphs > Your Graph). Here is what the stix_graph I created earlied looks like for the Objects I have added to the database;

ArangoDB Graph UI STIX example

As you can see from the first example, the two stix_objects in the Document Collection are represented as individual nodes on the graph, and the stix_relationships in the Edge Collection as an edge.

This is about the simplest example of a STIX Object Graph. Nodes (STIX SDO/SCOs) can have multiple edges (SROs) making for much more complex graphs. Similarly, relationships can be defined in lists in STIX SDO properties, e.g. object_refs).

In these cases, it is likely you will want to filter the information being returned using AQL queries. AQL is really well documented on the ArangoDB website, but I will cover some basic queries for this.

For example, here is a query using the first example to get the ID’s ( RETURN item.id) five (LIMIT 5) most recently created Objects (SORT item.created DESC) of type “indicator” (FILTER item.type == "indicator") in the stix_objects collection.

FOR item IN stix_objects
 FILTER item.type == "indicator"
 SORT item.created DESC
 LIMIT 5
 RETURN item.id

Which returns;

[
  "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f"
]

ArangoDB AQL Query last 5 STIX Indicator SDOs

Before I showed you the out-of-the-box graph view using the UI query builder (that I used to create the Graph stix_graph) – this created an automatically generated graph traversal query. It is also possible to write these queries to customise the nodes and edges returned in the graph to modelled. For example;

FOR vertex, edge, path IN 1..5
 OUTBOUND 'stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f'
 GRAPH 'stix_graph'
 RETURN path

Here is how this query is formed;

  • The FOR takes three variables, the vertex in the traversal, the edge in the traversal and the current path. The current path contains two members, vertices and edges. Here I am asking to return the paths matching the request that’s why I can find the edges and vertices keys in the result.
  • IN 1..5 specifies the minimal and maximal depth for the traversal. 0 would have been a traversal starting from the original vertex.
  • OUTBOUND specifies the direction to follow Possible values are OUTBOUND OR INBOUND OR ANY. The object I used as the original vertex for the traversal is the source in the relationship object. For this reason INBOUND would no return any result in our example.
  • stix_objects/{_key} defines the vertex where the traversal originates from.
  • GRAPH stix_graph is the name identifying the named graph to use for the traversal.

Running the query gives us the following json output (which can also be modelled as a graph):

[
  {
    "edges": [
      {
        "_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
        "_id": "stix_relationships/relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
        "_from": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "_to": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "_rev": "_ep_8O9C---",
        "type": "relationship",
        "id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
        "created": "2016-04-06T20:06:37.000Z",
        "modified": "2016-04-06T20:06:37.000Z",
        "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
        "relationship_type": "indicates",
        "source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
      }
    ],
    "vertices": [
      {
        "_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "_id": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "_rev": "_ep_1Eeq---",
        "type": "indicator",
        "id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "created": "2016-04-06T20:03:48.000Z",
        "modified": "2016-04-06T20:03:48.000Z",
        "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
        "labels": [
          "malicious-activity"
        ],
        "name": "Poison Ivy Malware",
        "description": "This file is part of Poison Ivy",
        "pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
        "valid_from": "2016-01-01T00:00:00Z"
      },
      {
        "_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "_id": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "_rev": "_ep_1Eeq--_",
        "type": "malware",
        "id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "created": "2016-04-06T20:07:09.000Z",
        "modified": "2016-04-06T20:07:09.000Z",
        "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
        "name": "Poison Ivy"
      }
    ]
  }
]

As the graphs grow, these AQL queries prove very efficient at returning STIX 2.1 data with complex relationship structures.

Hopefully this post has given those building cyber threat intelligence tools who are new to ArangoDB enough information to give them a few ideas, and avoid making some of the errors I have in the past – it can be complex to store and query STIX 2.1 Objects in regular databases without degrading performance.

If you want to dive a bit deeper, I suggest taking a look at the tutorials in the ArangoDB documentation which helped me a lot to create this post.

I am still learning ArangoDB but you have any questions about the content covered in these posts (or have feedback), please do not hesitate to drop me a message on Slack.

CONGRATULATIONS!

You have made it to the end of this short course.

Hopefully the last three months have given you a deeper understanding of modelling cyber threat intelligence using STIX 2.1 and a few tools that will help you turn this theory into reality.

Here are some useful links to bookmark following this course, some I have covered, some I have not, that you I find useful when working with STIX 2.1:

If you have any questions about the content in this tutorial, please do not hesitate to drop us a message on Slack.


STIX 2.1 Certification (Virtual and In Person)

The content used in this post is a small subset of our full training material used in our STIX 2.1 training.

If you want to join a select group of certified STIX 2.1 professionals, subscribe to our newsletter below to be notified of new course dates.




Discuss this post


Signals Corps Slack

Never miss an update


Sign up to receive new articles in your inbox as they published.