180317_algolit

180317_algolit_word2vec __NOPUBLISH__

On word2vec
a way to get from words to numbers
17 March 2018, Varia - Rotterdam

as part of the Algolit workgroup http://algolit.net/ (also the pads of previous sessions can be found there)
in the context of Algologs http://varia.zone/en/algologs.html
Notes of Friday evening: https://pad.vvvvvvaria.org/algologs Notes of Friday evening: https://pad.vvvvvvaria.org/algologs

You should know a word by the company it keeps. John Rupert Firth

Algolit
- short intro

Algolit recipes
- following OuLiPo
- the recipe should be manageable within a day
- examples of recipes from our previous meeting: https://pad.constantvzw.org/p/180301_algolit_datasets (lines 134 to 226)

word embeddings
> techniques

* different types of machine learning models

rule based, supervised, unsupervised

* different frequency counting techniques

a. count vectors
b. tf-idf vectors - it takes into account not just the occurrence of a word in a single document but in the entire corpus
c. co-occurance vectors - similar words tend to occur together and will have similar context, co-occurances and context windows

unsupervised + co-occurance vectors > word-embeddings
a prediction based technique
Using co-occurance vectors to make prediction / to find most similar words

a. continuous bag of words
b. skip-gram model

> history

word embedding versus distributional semantic model (see article below)
vs distributed representation vs semantic vector space vs word space
'word embedding' as a term gains popularity with the arrival of word2vec, but the concept exists beforehand
Introduction:
https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca
landmark papers:

2003: A Neural Probabilistic Language Model - Bengio et al. * http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf ('word embedding' is first mentioned)
2008: A unified architecture for natural language processing - Collobert and Weston * https://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf (first to show the utility of pre-trained vectors)
2013: Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al. * https://arxiv.org/pdf/1310.4546.pdf (word2vec appears for the first time-word2vec is a toolkit that allows for training or using pre-trained word embeddings)
2014: GloVe: Global Vectors for Word Representation Pennington et al. * https://www.aclweb.org/anthology/D14-1162 (Glove appears for the first time-GloVe is a similar toolkit)

How to take position towards word-embeddings? Can we log the internal logic of word-embeddings and write recipes in response to their way of functioning?

use cases of word-embeddings
- Perspective API https://www.perspectiveapi.com/ (worked together with the Wikimedia Detox project to put together a dataset https://meta.wikimedia.org/wiki/Research:Detox )
- word-embeddings in machine translation, to find translations of words in different languages - https://arxiv.org/pdf/1309.4168.pdf
- word-embeddings in query expansion (trigger more words with a specific search query) - https://ie.technion.ac.il/~kurland/p1929-kuzi.pdf & http://www.aclweb.org/anthology/P16-1035 (has a few nice graphs/tables)
- word-embeddings applied to other types of data (non-linguistic)

- biomedical research
- recommendation systems such as playlists2vec - https://github.com/mattdennewitz/playlist-to-vec (Spotify uses this)

- but they are also used as a sort of self-inspection tool within the NLP field: Understanding what other biases word embeddings capture and finding better ways to remove theses biases will be key to developing fair algorithms for natural language processing. Very sensitive subject AND a very tough one ... (via: Word-embedding trends in 2017 http://ruder.io/word-embeddings-2017/ )

word2vec
- touring together, word2vec step-by-step
- word2vec demo https://rare-technologies.com/word2vec-tutorial/#app

Speak about, analyse together: constraints / substructures / foundations / assumptions that condition word-embeddings (make word-embeddings possible)

(counting) exercises:
- Hungarian sorting dance: https://www.youtube.com/watch?v=lyZQPjUT5B4
- Abecedaire rules http://algolit.constantvzw.org/index.php/Abecedaire_rules, created by An Mertens
- Littérature définitionnelle, where each word in a sentence gets replaced by the definition from the online dictionary dataset Wordnet. A proces that can be reiterated infinitely on the transformed text. - https://gitlab.constantvzw.org/algolit/algolit/raw/master/algoliterary_encounter/oulipo/litterature_definitionelle.txt, created by Algolit
- Emmett Williams' Counting songs, Fluxus

A staple part of Fluxus festivals in the 1960s were Emmett Williams’ Counting Songs (1962) which consisted of the artists on stage counting the audience members one by one. Aside from being early pieces of performance art and poetry, minimal music and concept art, they also served the pragmatic purpose of obtaining “an exact head count to make sure that the management [of the festival venues] wasn’t cheating us”. from: http://cramer.pleintekst.nl/essays/crapularity_hermeneutics/#fnref4

Algolit scripts:
https://gitlab.constantvzw.org/algolit/algolit/tree/master/algologs
- gensim-word2vec - a python wrapper for word2vec, an easy start to work with word2vec (training, saving models, reversed algebra with words)
- one-hot-vector - two scripts created during an Algolit session to create a co-occurance matrix
- word2vec - a word2vec_basic.py script from the Tensorflow package, accompanied with Algolit logging functions, a script that allows to look a bit further into the trainingprocess
- word2vec-reversed - a first attempt of a script to reverse engineer the creation of word-embeddings, looking at shared context words of two words

installing homebrew https://brew.sh/
installing tensorflow https://www.tensorflow.org/install/

Further reading:
- History of word embeddings https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings https://arxiv.org/pdf/1607.06520.pdf
- Sebastian Ruder's three-piece blog on word embeddings http://ruder.io/word-embeddings-1/
- The Arithmetic of Concepts http://modelingliteraryhistory.org/2015/09/18/the-arithmetic-of-concepts-a-response-to-peter-de-bolla/

Notes:

Previous session on datasets, notes of that session: https://pad.constantvzw.org/p/180301_algolit_datasets

A deathrow dataset
A dataset on special trees in Brussel cfr Terra0 (legalaty of non-human entities as autonomous)

how to get from words to numbers?
We will focuss on word embeddings a technique which is prediction based technique
Word embeddings developed by Google

The idea of receipies is coming from Oulipo where you set up of constructions () rewrite in different styles.
What kind of recipes we can write ?

Georges Perec - a book without a letter e
Stijloefeningen by Raymond Queneau

Word embeddings
unsupervised learning
a lot of data, clustering words together by looking at direct context words, variable windows (how many words to the left and right of it are included that keep the central word 'company')
no more "bag of words" (how often a word appears in a text) or letter similarity
based on the words that keep the center word company
a way to find 'more similar' words
ex of dataset: Glove https://nlp.stanford.edu/projects/glove/ developed by Stanford University, based on Common Crawl (NGO) (download of 75% of internet)
companies like Google , FB (Fast texts https://research.fb.com/fasttext/ ) create their own

calculations within the textual space:
word2vec demo https://rare-technologies.com/word2vec-tutorial/#app
simple neural network technique

similar?
what does it mean?

tour through steps of word embeddings
based on basic example script of Tensorflow (Google framework)
and Frankenstein the novel
words with similar function cluster (colours, numbers, names of places...)
no stopwords : common short words, 'a', 'to, 'in'....
0. word embedding graphs of Frankenstein
'human/fellow', 'must/may' appear together... it makes sense for our eyes...
all steps are part of same script
vs Gensim word-to-vec: 1 line of code 'call model'
1. plain text of Frankenstein
all punctuation taken out, all lower case
2. bag of words, list
all spaces are replaced, list of words (tokenized)
3. dictionary of counts
words are counted
word can have 2 semantical meaning, but counted as same word
4. dictionary of index nr: positioning the words
order is the same, but nr of countes is replaced by index nr
zero element: 'UNK' (unknown)
dictionary is made with vocabulary size, how many words you want to include
in this case: 5000 words, every word that falls out, is replaced by UNK
efficiency for process - otherwise the calculation becomes too heavy
cfr Glove: 22 million words
-> try to do the same with less frequent/rejected words?
might not work because you need a lot of examples
notion of 'working properly': might not work but might be interesting
making a 'memory' (every word that's not remembered is thrown out)
x. reversed Frankenstein: side step
list of words that are thrown out
5. data.txt: rewriting 1
novel represented by index numbers
6. one-hot-vector batches
starts with small sample of words
needs to sort out context words
window frame: 1
computer reads index numbers
batch 1: center words
batch 2: context words
'any kind of pleasure I perceived'
center word 1: 'kind' (connected to left), 'kind' (connected to right)
x. one-hot-vector matrix
3 sentences
in order to calculate similarity of sentences, look at individual words and the company words
all 5000 words are center words
everything is initialized by 0, filled in after
7. training
list of words, starting with most common + 20 random numbers (you chose the amount, also called coordinates)
amount of coordinates depend on CPU of computer
creating a multidimensional vector space with 20 dimensions, filled in at next step
representation graph: brought back to 2 dimensions
dimensionality reduction ex principal component analysis : information is lost
8. embedding matrix
the in between step is too difficult to visualise
takes word, compares to negative other word, looks if they're similar
if they are, the 2 rows are put closer to each other
ex 'the' - 'these': compare on-hot-vector row, if window words show similarity (threshold for similarity), the rows are put lcoser to each other by changing coordinates so they're closer to each other in the matrix
similarity is calculated in
these are final word embeddings
x. logfile
while training script prints training process
loss value: goes down while training, threshold you define, a function that controls results of each step
from 0 to 100.000 steps: a parameter you can define
every time you run it, results of similarity are different
why???
-> what 2 words are compared, probability counts, .... levels of uncertainty makes that it changes all the time
result is relationships, not the position
different set of random numbers, values are different
making tests with more data? more similar cases
'peeling off the onion', 'opening the blackbox, finding another blackbox inside'
'making excuses' (Google, facial recognition; we'll increase dataset, train better etc)

side discussion:

We're losing information all the time when you take a picture
what would it mean if we tried to name other dimensions?
"once it's proven in mathematics, it starts to exist in the world"
but doesn't it also work the other way around?

matrix as a thinking tool: a group of vectors with 20 (in this case) dimensions

Results are never the same when you train the embeddings
What is causing this? Layers of probabilty calculations ? Moments of uncertainty in the calculations
If you train two models on the same input text, are the final embeddings the same?

- are the models the same?
- are the results the same? (the answers to the question of similar words)

An joined an international workshop in Leiden
issue of discrimination in datasets and/or process
what is a bias?
sociologists (history of researching discrimination) &
talking about black is about black man, and discussions around women are about white woman
a black hole for black women
We can adjust our biases, but algorithms cannot tweak themselves.
'shapeshifters' by continuing to training the model, one face becomes the other

ex. Perspective API
- Perspective API https://www.perspectiveapi.com/ (worked together with the Wikimedia Detox project to put together a dataset https://meta.wikimedia.org/wiki/Research:Detox)
determine level of toxicity in comments (NY Times, Guardian, Wikimedia)
automatically intervene & remove comment or have human moderator
based on word2vec
in Nov: giving odd results
nice: you can trace back the comments they used, the ratings (who rated? volunteers of Wikimedia / 30y, white, male, single)
ex I went to a gipsy shop. gives 0.24
ex I believe in Isis. 0.40
ex I believe in AI 0.01
die in hell = Sorry! Perspective needs more training data to work in this language.

'data-ethnography': who was creating the data
https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e75e3d7

Examples of exercises
Sorting algorithm
- Hungarian sorting dance: https://www.youtube.com/watch?v=lyZQPjUT5B4
by doing it, you experience the different behaviour of humans (optimized/intuition) and machines (repetition)
Fluxus http://cramer.pleintekst.nl/essays/crapularity_hermeneutics/#fnref4
Emmett Williams’ Counting Songs
described by Florian Cramer in http://cramer.pleintekst.nl/essays/crapularity_hermeneutics/#fnref4

Run word2vec_basic.py on mac
- python3 https://www.python.org/ftp/python/3.6.5/python-3.6.5rc1-macosx10.6.pkg & https://www.macworld.co.uk/how-to/mac/coding-with-python-on-mac-3635912/
- brew (package manager) https://brew.sh/
- pip > brew install python3
- tensorflow https://www.tensorflow.org/install/install_mac
- sublime
- dependencies for

pip3 install nltk numpy sklearn matplotlib scipy
if error <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed[nltk_data] (_ssl.c:661)>
when trying to run nltk('punkt') use /Applications/Python\ 3.6/Install\ Certificates.command

Installing nltk in linux/mac
sudo pip install -U nltk

And then:
    python
    import nltk
    nltk.download('punkt')

Installing scikit-learn for plotting the valuest (Mac OS)
sudo pip install -U scikit-learn

----------------------------------------------
important word2vec_basic parameters

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.

----------------------------------------------

to print a full (numpy) matrix:
numpy.set_printoptions(threshold=numpy.nan) numpy.set_printoptions(threshold=numpy.nan)

thinking of recipes...

A recipe to formulate word relations.
0. use a dataset (txt file)
1. pick two words
2. find out what context words they have in common (word2vec reversed)
3. read the words
4. use the words to formulate a description in one sentence

0. dataset: about.more (scraped about/more/ pages of mastodon instances)
1. instance & community
2. ['their', 'this', 'or', 'a', 'will', ',', ';', 'the', 'run', 'and', 'by', 'but', 'our', 'any', 'in', 'with', 'of', 'where', '!', 'inclusive', 'open', 'small', 'that', '.', ')', 'other', 'moderated', 'financially', 'local', 'as', 'for', 'international', ':']
3.
4. run a small moderated or inclusive local and international this

0. frankenstein
1. human & fellow
2. ['beings', 'creatures', 'my', 'the', 'a', 'creature', 'mind', ',']
3.
4. my creatures, the creature mind beings

"beings": {
        "fellow": {
            "freq": 4,
            "sentences": [
                "It was to be decided whether the result of my curiosity and lawless devices would cause the death of two of my fellow beings : one a smiling babe full of innocence and joy , the other far more dreadfully murdered , with every aggravation of infamy that could make the murder memorable in horror .",
                "I had begun life with benevolent intentions and thirsted for the moment when I should put them in practice and make myself useful to my fellow beings .",
                "These bleak skies I hail , for they are kinder to me than your fellow beings .",
                "They were my brethren , my fellow beings , and I felt attracted even to the most repulsive among them , as to creatures of an angelic nature and celestial mechanism ."
            ]
        },
        "human": {
            "freq": 6,
            "sentences": [
                "I had gazed upon the fortifications and impediments that seemed to keep human beings from entering the citadel of nature , and rashly and ignorantly I had repined .",
                "I had often , when at home , thought it hard to remain during my youth cooped up in one place and had longed to enter the world and take my station among other human beings .",
                "The picture appeared a vast and dim scene of evil , and I foresaw obscurely that I was destined to become the most wretched of human beings .",
                "I saw few human beings besides them , and if any other happened to enter the cottage , their harsh manners and rude gait only enhanced to me the superior accomplishments of my friends .",
                "The sleep into which I now sank refreshed me ; and when I awoke , I again felt as if I belonged to a race of human beings like myself , and I began to reflect upon what had passed with greater composure ; yet still the words of the fiend rang in my ears like a death-knell ; they appeared like a dream , yet distinct and oppressive as a reality .",
                "In other places human beings were seldom seen , and I generally subsisted on the wild animals that crossed my path ."
            ]
        }
    },

Memory of the World
for An Mertens

https://www.memoryoftheworld.org/blog/2014/12/08/how-to-bookscanning/

Michael Winkler

http://www.winklerwordart.com/

text cleaning: while trying to clean the text for further processing, we notice that it is hard to make a decision regarding 's':
lady catherine s unjustifiable endeavours
my aunt s intelligence
receipt that s he does (original text: receipt that s/he does)
by u s federal laws ((original text: by u.s. federal laws)

Tutorial Rob Speer: How to make a racist AI without really trying
word-embeddings + supervised learning layer
https://blog.conceptnet.io/2017/07/13/how-to-make-a-racist-ai-without-really-trying/

Varia archive: https://vvvvvvaria.org/archive/

Machine Learning for Artists on Youtube by Gene Kogan: good tutorials!

These pages are generated with Distribusi. 🌐 https://varia.zone