Graphing crypto

Merkle Science is a Singapore-headquartered blockchain data analysis firm that helps crypto companies, financial institutions, and government agencies detect, investigate, and prevent illegal activities that involve cryptocurrencies. Over the past couple of years, Merkle Science has been trying to visualise their data and make it easier to understand through use of a graph database.

Unlike relational databases, which use a ledger-style structure, graph databases are composed of dots (the nodes) and lines (the relationships between them). With this structure, graph databases are simpler and more powerful, making it easier to detect complex relationships and answer unanticipated questions.

Enter TigerGraph, a software company that specialises in graph analytics. The company claims that its software suite – also named TigerGraph – runs 40 to 300 times faster than its competitors, and delves five to 10 connections deep into the data without experiencing scale issues. It’s with this performance that Merkle Science chose TigerGraph’s analytics to construct a cryptocurrency network graph, which will help them pre-empt and prevent financial crime.

As cryptocurrency and blockchain increasingly become more mainstream, the ecosystem that surrounds and supports it will also develop rapidly, while increasing its prevalence of being a target for financial crime. Thus, risk management is key.

The cryptocurrency network graph that Merkle Science built using TigerGraph’s graph analytics lets customers run queries to identify the percentage of funds sent or received from different types of actors (such as darknet, exchanges, scams and smart contracts) from a specific location or address, which can help detect and potential criminal activity.

Merkle Science’s cryptocurrency network graph, which currently contains over 2.5 TB of data and consists of 5 billion vertices and 36 billion edges, supports a complete extract, transfer and load (ETL) each day that is said to take under an hour.

Connecting data across silos

TigerGraph’s analytics platform utilises a distributed database architecture with parallel processing, storing data in a compressed format. The Redwood City-based company claims that its graph database, which uses a C++ engine, is the only one capable of providing real-time deep link analytics across multiple nodes or hops for large datasets.

Here’s a use case from one of TigerGraph’s fintech clients: they have a relational database management system from Oracle on-site as they are unable to move their critical data to the cloud due to regulatory reasons. They also have some of their less critical data in a cloud data warehouse (i.e. Snowflake). Finally, they have an on-premise Hadoop cluster which they purchased physical/virtual servers for, with the intention of slowly moving their data to the cloud at some stage.

Data is continuously being added daily to these three data stores, but instead of analysing data in three silos, TigerGraph’s ecosystem of connectors help them amalgamate and analyse across all datasets, and uncover relationships in the data.

Exit SQL, Enter GSQL

Nirmal Aryath Koroth, Co-founder and Chief Technology Officer, Merkle Science. Image courtesy of Merkle Science.

According to Nirmal Aryath Koroth, Co-founder and Chief Technology Officer at Merkle Science, other graph databases weren’t able to process Merkle Science’s data fast enough to generate graphs in real time. He said that with TigerGraph, they can now do batch and streaming load simultaneously.

Nirmal AK added that TigerGraph’s GSQL software program also helps them implement complex graph algorithms, which he said would take longer to implement using other incumbents.

The implementation of TigerGraph enables Merkle Science the ability to analyse over 2.5 TB of data in real time. The new multi-hop feature on Merkle Science’s platform delves five to 10 hops into the data to better connect relationships that were not possible prior, unlocking deeper, wider, and operational analytics at scale.

Merkle Science previously relied on SQL databases. Nirmal AK’s team tried other incumbent graph solutions at first, but they did not perform well on important factors like scalability or high-speed performance.

“Unfortunately, with blockchain, the process of money laundering is simple, as funds transfer to multiple addresses is happening concurrently and at scale. On the SQL database, it was impossible to visualise the five to 10 hops depth of deep-link queries necessary to track where funds in question originated from, as opposed to limited visibility in recent transactions,” says Nirmal AK.

Merkle Science needed a database that was going to be able to handle the large quantities of data it possessed. They looked at multiple competitors, but while they appeared to fare well on lower scales, they suffered issues of poor ETL performance, and most importantly – streaming writes were an expensive and time-consuming process.

“On the other hand, TigerGraph has proven capabilities to handle 2.5 TB of data and more, with no signs of it approaching a bottleneck anytime soon,” reports Nirmal AK. “TigerGraph’s schema creation and evolution has proved to be very simple and elegant, which fit most of our requirements.”

“We utilised TigerGraph’s local loading capabilities to load all our data in under 24 hours. We had multiple loading jobs which were quick and simple to implement, with a wide variety of file types that met most of our needs,” says Nirmal AK.

At a daily/streaming level Merkle Science uses the RESTPP (TigerGraph’s custom REST server) loading job endpoints.

GSQL is user-friendly, highly expressive, Turing-complete, and allows queries to be developed at a significantly faster speed. The right query language for commercial use needs to express and solve real-world business problems across a vast variety of industries, according to Nirmal AK.

“TigerGraph supports this by being able to define graph schemas, and also supports complex data types,” he says. ”Finally, GSQL’s flexibility allows us to model sufficiently complex algorithms through a Turing complete query language. Other graph query languages face issues with speed, difficulty in fulfilling such tangible business needs, or require far larger development times. This prevents the formulation of analytics in real time, which is a core asset for fraud detection.”

Joe Lee, Vice President of Asia Pacific & Japan, TigerGraph. Image courtesy of TigerGraph.

According to Joe Lee, Vice President of Asia Pacific & Japan at TigerGraph, GSQL can run any advanced analytics within the TigerGraph database itself, and this can simplify a client’s workflow by avoiding the unnecessary transfer of large quantities of data.

“The GSQL language offers commands for data loading which perform many of the same data conversion, mapping, filtering, and merging operations that are found in enterprise ETL systems. In essence, we can work with ETL solutions as well for loading data or use our own built-in SQL-like language to perform the functions,” said Lee.

Crypto and graphs

Bitcoin (or any other cryptocurrencies based on public blockchains) is a network structure, with groups of transactions “chained” together in blocks. This allows the capture of the flow of funds between sender and receiver addresses.

With TigerGraph, Merkle Science was able to create schemas that better capture the different attributes of the blockchain, like making connections between addresses and transactions to make them relevant for its clients’ business use cases. 

According to Nirmal AK, cryptocurrency businesses offer a classic use case for graph analytics, where the data involved is unstructured and constantly changing. “Crypto and blockchain will continue scaling, and as it grows, the ecosystem around it will multiply. Derivatives, credit, and insurance are all industries in which companies will need similar types of risk management,” says Nirmal AK.

“Graph technologies are relatively nascent, having been around for less than 15 years. We are at the cusp of its possibilities, and personally, I’m excited for further developments in this space as data generation shows no signs of stopping,” he concludes.