Stock tweet topic modelling

Unveiling Market Sentiments - Analyzing Stock Tweets with Advanced NLP Techniques





Introduction

The world of finance is dynamic, and staying ahead requires not just an understanding of market trends but also an insight into the sentiments that drive these trends. In this era of information overload, social media, particularly Twitter, has become a treasure trove of opinions and sentiments related to the stock market. Leveraging advanced Natural Language Processing (NLP) techniques, I embarked on a journey to extract valuable insights from a dataset of stock-related tweets.



1. Data Collection

To kickstart the project, I gathered a dataset from Kaggle containing 80k+ stock-related tweets. This dataset served as the foundation for uncovering sentiments and trends within the stock market.



2. Basic Preprocessing

Cleaning and preparing the data are crucial steps in any NLP project. I performed a series of preprocessing steps to ensure the dataset’s quality and coherence. This included removing patterns, Twitter handles, special characters, numbers, and punctuations. Additionally, I eliminated short words, stop words, and handled spaces to refine the dataset for subsequent analysis.



3. Topic Modeling with BERTopic

Using BERTopic, a powerful library for topic modeling I clustered the tweets to identify key topics within the dataset. The clustering algorithm used was HDBSCAN. This helped uncover latent patterns and prevalent themes among the stock-related tweets. But it can be easily observed that the data is not well distributed. We will take care of that in a later step (Step 6).



topic-clusters.png



4. Refinement with Llama 2 Model

The journey didn’t end with BERTopic; I delved deeper into topic refinement using the Llama 2 model and prompt engineering. This step aimed to enhance the quality of topics extracted, providing a more nuanced understanding of the sentiments expressed in the tweets.



5. Topic Labeling

With well-defined topics at hand, each tweet was marked with its respective topic. This not only organized the dataset but also laid the groundwork for subsequent supervised learning steps.



6. Balancing the dataset

Initially the data was not well balanced. So I had to remove classes with less than 300 records and use random oversampling to balance the dataset.



tweet-topic-modelling-initial-dist.png

Initial class distribution



tweet-topic-modelling-mid-dist.png

Class distribution after removing classes with less than 300 samples



tweet-topic-modelling-final-dist.png

Final class distribution after random oversampling



7. Fine-Tuning with DistilBERT

Equipping myself with a DistilBERT base model, sourced from the Hugging Face Model Hub, I embarked on fine-tuning it using my meticulously curated dataset. This step aimed to enhance the model’s understanding of the nuances in stock-related language.



8. Train-Test Split

To evaluate the model’s performance, I divided the dataset into a 30/70 test/train split. This ensured a robust evaluation and prevented overfitting, allowing the model to generalize well to new data.



9. Model Evaluation

The true litmus test came with evaluating the model on the test dataset. Impressively, the fine-tuned model showcased a loss of 0.0388 and an accuracy of 99.25%. These metrics affirmed the model’s ability to discern sentiments and topics within stock-related tweets.



tweet-topic-modelling-accuracy.png



10. Model Deployment

To make this powerful tool accessible to the wider community, I deployed the trained model to the Hugging Face Model Hub. The model is now available for use and exploration by fellow enthusiasts and analysts.





Conclusion

In the ever-evolving landscape of finance, understanding market sentiments is as crucial as analyzing raw data. The amalgamation of advanced NLP techniques, from topic modeling with BERTopic to fine-tuning with DistilBERT, has empowered me to uncover intricate sentiments within the realm of stock-related tweets. The deployment of the model to the Hugging Face Model Hub and the creation of a user-friendly app further democratize access to these insights, inviting enthusiasts and professionals alike to explore the dynamic world of market sentiments. This journey showcases the power of NLP in transforming raw data into actionable intelligence, bringing us one step closer to unraveling the mysteries of the stock market.