A Python package for Fuzzy Topic Models
This is the Python code to train Fuzzy Latent Semantic Analysis-based topic models. The details of the original FLSA model can be found here. With my group, we have formulated two alternative topic modeling algorithms 'FLSA-W' and 'FLSA-V' , which are derived from FLSA. Once the paper is published (it has been accepted), we will place a link here too.
Topic modeling is a popular task within the domain of Natural Language Processing (NLP). Topic modeling is a type of statistical modeling for discovering the latent 'topics' occuring in a collection of documents. While humans typically describe the topic of something by a single word, topic modeling algorithms describe topics as a probability distribution over words.
Various topic modeling algorithms exist, and one thing they have in common is that they all output two matrices:
From the first matrix, the top n words per topic are taken to represent that topic.
On top of finding the latent topics in a text, topic models can also be used for more expainable text classification. In that case, documents can be represented as a 'topic embedding'; a c-length vector in which each cell represents a topic and contains a number that indicates the extend of which a topic is represented in the document. These topic embeddings can then be fed to machine learning classification models. Some machine learning classification models can show the weights they assigned to the input variables, based on which they made their decisions. The idea is that if the topics are interpretable, then the weights assigned to the topics reveal why a model made its decisions.
The general approach to the algorithm(s) can be explained as follows:
The original FLSA approach aims to find clusters in the projected space of documents.
Documents might contain multiple topics, making them difficult to cluster. Therefore, it might makes more sense to cluster on words instead of documents. That is what what we do with FLSA-W(ords).
Trains a Word2Vec word embedding from the corpus. Then clusters in this embedding space to find topics.
FLSA-W clusters on a projected space of words and implicitly assumes that the projections ensure that related words are located nearby each other. However, there is no optimization algorithm that ensures this is the case. With FLSA-V(os), we use the output from Vosviewer as input to our model. Vosviewer is an open-source software tool used for bibliographic mapping that optimizes its projections such that related words are located nearby each other. Using Vosviewer's output, FLSA-V's calculations start with step 4 (yet, step 1 is used for calculating some probabilities).
Many parameters have default settings, so that the algorithms can be called only setting the following two variables:
input_file, The data on which you want to train the topic model.
num_topics, The number of topics you want the topic model to find.
Suppose, your data (list of lists of strings) is called data
and you want to run a topic model with 10 topics. Run the following code to get the two matrices:
flsa_model = FLSA(input_file = data, num_topics = 10)
prob_word_given_topic, prob_topic_given_document = flsa_model.get_matrices()
To see the words and probabilities corresponding to each topic, run:
flsa_model.show_topics()
Below is a description of the other parameters per algorithm.
num_words, The number of words (top-n) per topic used to represent that topic.
word_weighting, The method used for global term weighting (as describes in step 2 of the algorithm)
cluster_method, The (fuzzy) cluster method to be used.
svd_factors, The number of dimensions to project the data into.
id | x | y |
---|---|---|
word_one | -0.4626 | 0.8213 |
word_two | 0.6318 | -0.2331 |
... | ... | ... |
word_M | 0.9826 | 0.184 |
num_words, The number of words (top-n) per topic used to represent that topic.
cluster_method, The (fuzzy) cluster method to be used.
map_file.txt
: map_file = pd.read_csv('<DIRECTORY>/map_file.txt', delimiter = "\t")
numpy == 1.19.2
pandas == 1.3.3
sparsesvd == 0.2.2
scipy == 1.5.2
pyfume == 0.2.0