Real Time Machine Learning Architecture & Sentiment Analysis

The first section of this blog post explains the business case and the value chain for a sentiment analysis application in Finance and Trading, when the second section proposes a real time big data, deep and machine learning architecture to perform sentiment analysis at scale.

The below schema gives an overview of how sentiment analysis is used in Trading and Finance.

Value Chain Sentiment Analysis - Big Data, Deep and Machine Learning Architecture

1.Access to News / News management

  • Visualization tools
  • Filtering tools
  • On demand view
  • Feed from multiple sources:
  • Social Media
  • Web based content
  • Private sources
  • Internal data
  • News Tag Cloud
  • Filtering news feed with Social media blotter, news blotter
  • Search Engine on demand

2.News Content Alerts  based on sentiment indicator

Provide accurate information from Big Data environment and pushed it front of Users in real time for Risk management.

  • Topics detection
  • Rumours alerts
  • News qualification per importance

3.Dashboard

  • Consolidated Dashboard
  • Portfolio Alerts
  • Relevant information from single screen
  • Automatic Alert
  • Integrated to OMS

4.Actionable indicators

  • Users receive news signals for trading/hedging/risk management based sentiment indicator.
  • Provide relevant news analytics indicator for hedging or trade idea generation.

5. Algo Trading / Robo Trading

  • Real Time algorithmic trading  Sentiment indicator and News Analytics.
  • Fully integrated news analytics signals integrated to algorithmic trading strategies.

A News Analytics Case

Information Extraction of Text

Let’s check a news. What’s the information you are going to highlight in the piece of news? 

NER Reuter News

This is what I see in the news:

The news is published by Reuters on Oct 21. It states acquisition. The two entities mentioned are AT&T Inc and Time Warner Inc. The tone of the reporter for the events is “boldest move”. I think maybe you can see more in the news.

So this is the term-based way to view a news as I mentioned. These will extremely useful when one tries to monitor a set of events or companies and so one.

Text Feature Extraction for Machine Learning Classification

Another way is to look at the news as a whole.

For a computer to understand a news, we have to be represented it by numbers.

VSM 1

VSM is a popular way, which uses a vector to represent a document. The basic the idea is to use the vector of term frequency to represent a document.

This is a minimal example:

VSM 2

At the beginning, we have a vocabulary that contains the universe of the words. We can just count the term frequency for document set and put them in a vector. So one document is a term-frequency vector for machines.

To solve the issue that the simple TF emphasize too much on the a term which is almost present in the entire corpus, It is weighted by the document frequency.

TFIDF

So that one document is able to represent by TD-IDF vector.

Such vectors then are ready to the machine learning applications:

Let me summarize the processes.

The news comes from all kinds of sources like Bloomberg, blogs, and social medias. We will apply feature extraction to each article on real-time and apply the machine learning models that are trained using 15 years history of articles and end up with topic labels for the article.

Another thing we can do is to calculate the sentiment on each article in real-time. And the sentiment is either aggregated on time-based or indexed upon instruments, sectors or emotions. We also can scan the information mentioned in the news and highlighted by its company, people, events, regions.

After these calculations, we need to render or even send the alerts to users in real time.

Architecture Requirements and Big Data Tools Applied

To implement a real time Big data, deep and machine learning architecture, here are 5 things that we’d like to consider for our applications:

  1. Guaranteed data processing is saying that we cannot afford to lose any information.
  2. We want our analytics engine grows with fewer efforts as the data grow bigger and bigger
  3. The servers have to be fault-tolerance.
  4. Higher level abstraction is preferred, which the workload on programming is minimum
  5. The model received from batch training can be loaded into the real-time layer in order to achieve real-time classification and predictive analytics.

This is the final solution we have:

Apache Spark Storm

We combine the Apache spark and Apache storm together to form our news analytics engine.

Apache Spark 1

Apache Spark 2

The function of spark is to produce the model on massive historical text Data. And these models will be loaded into storm to produce the real time analytics.

First, when to use distributed data processing tools? I would say the time that your data and computation can not fill in one single machine.

In our case, yes, our data are too much and the computation we are trying to do is beyond the capacity of one machine.

Hadoop is the previous generation to solve the issue.

Since our purpose is to build a real-time news analytics system so that we definitely need a real-time computation system. At the end, we choose storm.

Apache Storm 1

Apache Storm 2

One of the reason is that it’s really fast. In its official website, it claims that it produces one million 100 bytes messages per second per node in a quite common machine. It may be not comparable to the Hedge Fund speed. But it’s quite good to our news analytics application.

The second reason we choose it because it’s really a robust system. Remember our 5 requirements. It satisfied our needs.

Architecture that combines all

Let’s have a look at the architecture of the whole applications:

The producers are generating the news to the message queue Kafka. And such messages passed in storm clusters. And many analytics are running in real-time. For example for the topic classification, a machine learning model will be loaded from HDFS to storm and the results come out immediately if any news arrives. Then the results will be published to the web app or updated to the database and search engine. That’s the real-time layer.

In the batch layer, we train and test the models using Apache Spark using our massive historical data that stays either in database or the distributed file system. And produce and dump the models in DFS.

This deep and machine learning architecture is not only fit for our news analytics application. It’s also fit for others. For example, scale analysis pipeline,  live stats, recommendations, predictions, real-time analytics, online machine learning. It all depends on the message fed to the producers and algos in the storm.

News analytics | Alternative data

Send us a mail : contact@infotrie.com

Check our market blog or our AI and NLP blog

Want to access our data? Try our API on api.infotrie.com or access our FinSentS platform.


itrie

InfoTrie Admin

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *