I'm always on the lookout for new and interesting technologies to play around with on the side. Lately, I've been digging into OpenAI's language models, and I thought it would be cool to use them to solve a problem that artists, creators, and influencers might face: analyzing their online reputation on Twitter.
In this post, I'll walk you through a project I created that monitors your Twitter mentions, analyzes them using OpenAI's GPT-3, and groups them by topic, sentiment, and total number of tweets. This can help you identify areas of strength and weakness in your online presence, as well as track how your reputation changes over time.
I'll cover everything from setting up authentication for the Twitter and OpenAI APIs, to preprocessing the text of the tweets, to using KMeans or DBSCAN clustering to group similar mentions together, and visualizing the results in a Pandas dataframe or Markdown table.
I hope this project inspires you to tinker with OpenAI and explore new ways of using language models to solve interesting problems!
Prerequisites
Prerequisites
To follow this tutorial, you'll need the following:
- A Twitter account with a Developer App and API keys
- An OpenAI API key
- Python 3.7 or later with the following packages installed:
- tweepy
- openai
- pandas
- matplotlib
- scikit-learn
Overview
Here's an overview of the steps we'll take in this tutorial:
- Set up authentication for the Twitter and OpenAI APIs.
- Retrieve recent mentions of your Twitter account.
- Clean and preprocess the text of the tweets.
- Use OpenAI's GPT-3 to analyze the sentiment and topics of the tweets.
- Group the tweets by topic using KMeans or DBSCAN clustering.
- Analyze the sentiment of each group of tweets.
- Visualize the results in a Pandas dataframe or Markdown table.
Step 1: Set up authentication for the APIs
To use the Twitter and OpenAI APIs, you'll need to authenticate using your API keys. You can create a .env file to store your keys and load them into your Python script using the dotenv package.
Here's an example of how to load your API keys from a .env file:
import os from dotenv import load_dotenv load_dotenv() TWITTER_API_KEY = os.getenv("TWITTER_API_KEY") TWITTER_API_SECRET = os.getenv("TWITTER_API_SECRET") TWITTER_ACCESS_TOKEN = os.getenv("TWITTER_ACCESS_TOKEN") TWITTER_ACCESS_SECRET = os.getenv("TWITTER_ACCESS_SECRET") OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
Step 2: Retrieve recent mentions of your Twitter account
Once you've authenticated with the Twitter API, you can retrieve recent mentions of your account using the tweepy package.
import tweepy auth = tweepy.OAuth1UserHandler( TWITTER_API_KEY, TWITTER_API_SECRET, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_SECRET ) api = tweepy.API(auth) mentions = api.mentions_timeline()
Step 3: Clean and preprocess the text of the tweets
To prepare the text of the tweets for analysis, you can clean and preprocess it using regular expressions, lowercase it, remove punctuation, remove stopwords, and stem or lemmatize the words.
import re import nltk from nltk.corpus import stopwords from nltk.stem import SnowballStemmer stemmer = SnowballStemmer('english') stop_words = set(stopwords.words('english')) def preprocess_text(text): text = re.sub(r'http\S+', '', text) # Remove URLs text = re.sub(r'[^\w\s]', '', text) # Remove punctuation text = text.lower() # Lowercase words = text.split() words = [word for word in words if word not in stop_words] # Remove stopwords words = [stemmer.stem(word) for word in words] # Stem words text = " ".join(words) return text preprocessed_mentions = [preprocess_text(mention.text) for mention in mentions]
Step 4: Use OpenAI's GPT-3 to analyze the sentiment and topics of the tweets
To analyze the sentiment and topics of the preprocessed tweets, we can use OpenAI's GPT-3 API. We'll pass in each preprocessed tweet as the prompt, and use the davinci language model to generate a response that includes the sentiment and topics.
from collections import defaultdict model = openai.api.Completion.create( engine="davinci", prompt="", max_tokens=1024, n=1, stop=None, temperature=0.5, ) mentions_summary = defaultdict(list) for mention, preprocessed_mention in zip(mentions, preprocessed_mentions): prompt = f"Analyze the sentiment and topics of this tweet: {preprocessed_mention}" model_response = model.generate(prompt=prompt) summary = parse_model_response(model_response) mentions_summary[summary["topic"]].append(summary["sentiment"])
In this code, parse_model_response is a function that extracts the topic and sentiment from the response generated by the GPT-3 model. The mentions_summary dictionary is a defaultdict that groups mentions by topic and sentiment.
Step 5: Group the tweets by topic using KMeans or DBSCAN clustering
To group similar tweets together, we can use a clustering algorithm such as KMeans or DBSCAN. We'll use the scikit-learn package to apply these algorithms to the preprocessed tweets.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans vectorizer = TfidfVectorizer() vectorized_mentions = vectorizer.fit_transform(preprocessed_mentions) kmeans = KMeans(n_clusters=5, random_state=0).fit(vectorized_mentions) cluster_labels = kmeans.labels_
This code applies the TfidfVectorizer to the preprocessed mentions to create a matrix of weighted features, and then applies the KMeans algorithm with 5 clusters to group the mentions together. The resulting cluster_labels array contains the label of the cluster for each mention.
Step 6: Analyze the sentiment of each group of tweets
To analyze the sentiment of each group of tweets, we can iterate over the clusters and calculate the average sentiment score for each one.
cluster_sentiments = defaultdict(list) for cluster_label, sentiment in zip(cluster_labels, mentions_summary.values()): cluster_sentiments[cluster_label].extend(sentiment) for cluster_label, sentiment in cluster_sentiments.items(): average_sentiment = sum(sentiment) / len(sentiment) print(f"Cluster {cluster_label}: {average_sentiment}")
In this code, we create a new cluster_sentiments dictionary that groups sentiments by the label of the cluster. We then calculate the average sentiment score for each cluster by summing the sentiment scores and dividing by the total number of mentions in the cluster.
Step 7: Visualize the results in a Pandas dataframe or Markdown table
To visualize the results of the analysis, we can create a Pandas dataframe or Markdown table that summarizes the sentiment and topic of the mentions.
import pandas as pd data = [] for topic, sentiment in mentions_summary.items(): total_tweets = len(sentiment) average_sentiment = sum(sentiment) / total_tweets data.append({"topic": topic, "sentiment": average_sentiment, "total_tweets": total_tweets}) df = pd.DataFrame(data) print(df)
This code creates a Pandas dataframe with three columns: "topic", "sentiment", and "total_tweets". The "topic" column contains the topic of the mentions, the "sentiment" column contains the average sentiment score for the topic, and the "total_tweets" column contains the total number of tweets for the topic.
We can also use the tabulate package to generate a Markdown table from the dataframe:
from tabulate import tabulate print(tabulate(df, headers='keys', tablefmt='pipe'))
This code generates a Markdown table with the same columns as the Pandas dataframe.
Conclusion
I had a blast using OpenAI's GPT-3 to analyze the sentiment and topics of my Twitter mentions. By preprocessing the text of the tweets, applying KMeans or DBSCAN clustering to group similar mentions together, and visualizing the results in a Pandas dataframe or Markdown table, I was able to gain insights into some interesting conversations. Check out the examples below!
While this project was just for fun, I believe there are many real-world applications of this kind of analysis, especially for artists, creators, and influencers who need to track their online presence and reputation.
I hope this post inspires you to take a closer look at OpenAI and explore how you can use language models to solve interesting problems of your own!