← GO BACK

February 15, 2024

Stock Analysis: Creating a reliable LLM with Bavest and OpenAI

Autor:
Bavest
Engineering
Introduction

The advent of large language models (LLMs) has marked the onset of a new era of innovation in natural language processing (NLP), promising transformative capabilities across diverse domains. In particular, the realm of finance has emerged as a focal point of exploration, as LLMs offer the potential to revolutionize how financial data is analyzed and interpreted. First we will look at the tech stack and architecture in general, then we will make a concrete example, which you can try yourself.

However, there are two big challenges regarding equity analysis:

  1. Minimal hallucination of LLMs is required.
  2. Getting all and latest financial data from listed companies, including all reported KPIs such as sustainability data.

These challenges can be solved using “RAG,” which stands for retrieval-augmented generation. Retrieval-augmented generation (RAG) emerges as an AI technique addressing the persistent challenge of hallucinations. By supplementing existing AI models with external information beyond their training datasets, RAG elevates their capabilities to tackle specific tasks effectively.

So RAG needs to be integrated with an operational data store capable of converting queries into numerical vectors, essentially forming a vector database. This vector database enables the model to continuously learn and adjust in real-time, thereby enhancing its comprehension dynamically. In the case of equity analysis, we will access annual reports, quarterly reports and sustainable reports of companies. This enables the RAG to access all latest financials and other KPIs, plus older data points.

Now collecting all reports of companies, since the IPO, is a big challenge itself. We at Bavest provide all published reports since IPO. We use AI, specifically reasoning, Our AI uses reasoning to download the reports from the context of websites. It recognizes the newer reports and can thus automatically collect historical and the latest reports for over 60,000 stocks.

Strategy for Internal LLM Development

Developing an internal RAG system requires a systematic approach, starting with a clear understanding of the project's objectives and requirements. It's essential to assemble a multidisciplinary team comprising domain experts, data scientists, and software engineers to ensure a holistic approach to development. Additionally, establishing a well-defined project roadmap with clear milestones and deliverables is crucial for managing expectations and ensuring alignment with organizational goals. Continuous feedback loops and iterative development processes should be implemented to address evolving user needs and technical challenges effectively.

Tech Stack Overview

Building a RAG system involves utilizing a diverse set of technologies and libraries to handle various tasks, including data acquisition, text extraction, retrieval, generation, and user interface development. Commonly used libraries and frameworks include PyPDF2, PDFMiner, or Textract for text extraction from PDF documents, and Elasticsearch or Apache Solr for document indexing and retrieval. For natural language processing tasks, pre-trained language models such as GPT (e.g., GPT-3) can be fine-tuned on domain-specific datasets to enable generation of insightful responses.

Data Acquisition

Utilizing Bavest, which provides access to equities worldwide, historically since IPO, is a strategic choice for acquiring annual and quarterly reports. The Bavest API offers a comprehensive database of reports for large, mid, and small-cap companies, facilitating a broad analysis of the market landscape. By leveraging our API, your software engineers and data scientists can streamline the process of fetching the latest reports, ensuring access to up-to-date information for analysis.

Text Extraction

Once the reports are retrieved from API X, the next step is to extract relevant text content from the PDF documents. Libraries such as PyPDF2, PdfMiner, or Textract can be employed for this purpose. These libraries enable developers to extract text while handling challenges such as formatting inconsistencies and non-textual elements present in the documents.

Retrieval

Implementing a robust retrieval system is essential for efficiently identifying relevant documents based on user queries. Techniques such as BM25 or TF-IDF can be used for document retrieval, with indexing performed using Elasticsearch or Apache Solr. By indexing the preprocessed text data, the retrieval system can quickly retrieve reports that match the user's search criteria, enhancing the overall usability of the RAG system.

Generation

The generation component of the RAG system involves leveraging pre-trained language models such as GPT to generate insights and responses based on the retrieved reports. Fine-tuning the language model on a dataset containing questions and answers related to annual reports enables the model to provide accurate and contextually relevant responses to user queries. This capability enhances the system's ability to analyze and interpret complex financial information, empowering users to make informed decisions.

Integration/UI

Developing a user-friendly interface for portfolio managers to interact with the RAG system is crucial for maximizing its utility and adoption. Whether through a web-based dashboard or a command-line interface, the interface should enable users to easily submit queries, view relevant reports, and access generated insights. Integration of the retrieval and generation components ensures a seamless user experience, allowing portfolio managers to extract actionable insights efficiently.

Evaluation & testing phase

During the evaluation and testing phase, the performance of the RAG system is assessed using a combination of test cases and real-world usage scenarios. Feedback from Portfolio Managers and Other Stakeholders is collected to identify areas for improvement and refine the system's capabilities. Iterative testing and validation help ensure that the system meets the needs of users and delivers value in a practical setting.

Deployment

Deploying the RAG system to a production environment involves ensuring scalability, reliability, and security. Continuous monitoring of system performance and user feedback enables iterative improvements to be made post-deployment. By leveraging cloud-based infrastructure and DevOps practices, developers can streamline the deployment process and maintain the system effectively over time.

Creating a chatbot for equity analysis with Bavest & OpenAI

As explained earlier, ChatGPT will work pretty well in this use case. We'll be using OpenAI's LLM APIs. Although ChatGPT can be used, it is limited in the amount of text allowed in a single request. This limitation results from the fact that the model only supports a maximum of 4000 tokens (for simplicity, each word of a text corresponds to a token). As a result, it is necessary to divide a large text into smaller portions. Steps 1, 2, and 3 relate to creating an index.

1. Prerequisite
  • An OpenAI API key is required. Please create one here: https://openai.com/blog/openai-api
  • You can obtain a test API key from Bavest for the company reports database here: Write to us at support@bavest.co and tell us your project goal or make an appointment if you would like to find out more and want to know how we can help you implement your internal LLM: https://calendly.com/ramtin-babaei/bavest-demo-en

2. Data processing

Step 1: The extensive text is divided into smaller segments. This segmentation ensures compliance with the 4096 token limit when using the TXT-Davinci 003 API.

Step 2: Generate embeds for every text segment using OpenAI's cheaper ADA API. Embeddings serve as numeric representations of text.

Please note that the code uses the “text-embedding-ada-002” model, which is the most economical option available.

Step 3: Once the index is created, we store it in index storage. There are a few options for production-ready systems like Pinecone, Redis, etc. To keep things simple, the example below is saved in a JSON file. This was the last step in preparing the data. The next steps A-D are about how we handle our data.

Chat with documents

Step 1: Load the index from the JSON file and give it a question.

Step 2: Based on the question, the index performs a similarity search in the index to identify relevant sections of text.

Step 3: This is similar to ChatGPT Q&A, in this step, the question is sent to OpenAI's txt-davinci-003 API together with the context (relavent chunk of text).

Step 4: OpenAI's txt-davinci-003 API responds with a response from the context that was submitted to it.

In the example below, a PDF file that contains information about Nvidia's quarterly report is used for the questions and answers, with the Llama_Index Python library doing most of the work.

Conclusion

In summary, the development of a retrieval-augmented generation system offers significant potential for improving portfolio analysis and decision-making in the financial sector. By using advanced NLP techniques and APIs such as API X, companies can automate the retrieval and analysis of company reports so that portfolio managers can efficiently gain actionable insights. With the advancement of technology, RAG systems are able to play a central role in promoting innovation and efficiency in asset management and investment strategies.

blog

More articles