In separate but related work, Tim Loughran and Bill McDonald of the University of Notre Dame explored sentiment analysis of 10-Ks, discovering in a 2011 paper that the existing dictionary of words used to determine sentiment a text was not well suited to the financial field. . For example, words such as responsibility, costAnd tax have been rated as negative for sentiment using the traditional dictionary, but these words are not necessarily negative when used in a financial context. Loughran and McDonald in turn created a dictionary adapted to finance.
Other researchers have developed new techniques for analyzing textual data. Tarek Alexander Hassan of Boston University, Stephan Hollander of Tilburg University, Laurence van Lent of the Frankfurt School of Finance and Management, and Ahmed Tahoun of the London Business School (then a researcher at Booth) published research in 2019 using a simple algorithm to assess political risk in profits. call transcripts. He counted bigrams (combinations of two words including the Constitution Or public opinion) used in conjunction with the words risk And uncertaintyor their synonyms, to identify potential risks for businesses. The higher the number, the greater the political risk for the company, according to the study. The subsequent papers spawned a startup, NL Analytics, which works with central banks and international organizations to use these economic monitoring methods.
Leaps that led to deeper understanding
Finance and accounting have long sought to learn from text. Economists also originally used a “bag of words” model. This relies on counting the frequency of words in a text: for example, how many times does a document include the words capital And expenses? In this case, the more frequent these words are, the more likely it is that the document discusses company policies.
This method is simple: In 1963, the late Frederick Mosteller and David L. Wallace used it to argue that James Madison, not Alexander Hamilton, wrote 12 of the journal’s 85 essays and articles. Federalist documents whose paternity was disputed. By counting commonly used words in known texts by Madison and Hamilton, they could compare them with the number of those words in the journal’s controversial articles. Federalist documents.
However, the method is also limited. It does not take into account potentially important information such as grammar or the order in which words appear. As a result, it is unable to capture much in terms of the context of a document. A company’s 10-K filing may state that “increased transportation costs offset our revenue gains,” and a bag of words may interpret that as a positive statement; after all, the word increase and the sentence income gains can appear confident. But it does not take into account the fact that increase taken with costs is negative and compensate changes the meaning of income gains.
Google researchers took a big step toward incorporating this context in 2013 when the company introduced word2vec, a neural network-based model that learns vector representations of words and captures semantic relationships between them. Vectorization has enabled ML models to process and understand text in a more meaningful way. If you have three related words, like man, kingAnd womenword2vec can find the next word most likely to match this group, queenby measuring the distance between the vectors assigned to each word.
And in a 2017 paper, a team of researchers led by Ashish Vaswani, then at Google Brain, introduced what deep learning practitioners call transformer architecture. Transformers form the basis of the large language models we know today and represent a significant improvement over previous architectures in their ability to understand and generate human language, something word-based models could not do .
An important LLM, BERT (Bidirectional Encoder Representations from Transformers), is used to understand the context of words but was not designed to generate text. It works by considering the words that appear before and after a particular word to decipher its meaning.
Meanwhile, GPT (generative pre-trained transformer) is able to predict the most likely next word in a sequence based on the text preceding it. For example, finish this sentence: “Why did the chicken cross the _____?” » Your brain automatically fills the blank with the word road as the most likely next word, although many other words would work here, including Street, highwayor maybe even court. GPT does the same thing. Its parameters can, however, be set so that it does not always choose the most likely word. This allows for more creativity in the text it generates.
Today, these LLMs are also tools applied to finance, allowing researchers and practitioners in the field to extract increasingly valuable information from data of all kinds.