Basic R Automation

May 11, 2018, 5:48 am

≫ Next: Upcoming public courses on Text mining with R, Statistical machine learning with R, Applied Spatial Modelling with R, Advanced R programming, Computer Vision and Image Recognition

≪ Previous: An overview of keyword extraction techniques

Last Wednesday, a small presentation was given at the RBelgium meetup in Brussels on Basic R Automation. For those of you who could not attend, here are the slides of that presentation which showed the use of the cronR and taskscheduleR R packages for automating basic R scripts.

If you are interested in setting up a project for more advanced ways on how to automate your R processes for your specific environment, get in touch.

{aridoc engine="pdfjs" width="100%" height="550"}images/bnosac/blog/Basic_R_Automation.pdf{/aridoc}

↧

Upcoming public courses on Text mining with R, Statistical machine learning with R, Applied Spatial Modelling with R, Advanced R programming, Computer Vision and Image Recognition

September 5, 2018, 11:54 pm

≫ Next: How to detect hatespeech in plain text #schildnvrienden

≪ Previous: Basic R Automation

I'm happy to announce that the following list of courses for R users is ready to be booked. All courses are face-to-face courses held in Belgium.

08-09/10/2018: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to training@di-academy.com
15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
05-06/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

For a list of all R courses which can be given in-house at your site, visit http://www.bnosac.be/index.php/training and get in touch to schedule the course.

r training

↧

How to detect hatespeech in plain text #schildnvrienden

September 7, 2018, 5:06 am

≫ Next: udpipe version 0.7 for Natural Language Processing (#NLP) alongside #tidytext, #quanteda, #tm

≪ Previous: Upcoming public courses on Text mining with R, Statistical machine learning with R, Applied Spatial Modelling with R, Advanced R programming, Computer Vision and Image Recognition

Yesterday there was a pretty controversial Pano TV documentary called 'Wie is Schild & Vrienden echt' at the national television channel 'één' (https://www.vrt.be/vrtnu/a-z/pano/2018/pano-s2018a10). The documentary revealed the internal communication of a right-wing group from Belgium, called #schildnvrienden.

After that, there was a show by Van Gils & gasten where a representative of the police explained or tried not to explain how the police can or can not monitor online private groups. It was pretty hilarious how she tried to manage not to say anything about the internal online monitoring system they apparently have.

That reminded me that a few years ago, I created an R package which can easily detect hate speech. I finally put it online on github today. You can find it here. The R package used a dictionary which is made available by the University of Antwerp which I think is the basis of the hate speech detection algorithms that currently the police in Belgium is running.

Example

How does that hate speech detection system work? Pretty simple, a dictionary of hate speech terminology and hate speech regular expressions are set up and next you just provide some text to it, the data is being cut up into words and it sees which words are part of the dictionary. As an example below, let's try it out on a message by the leader of that #schildnvrienden group to see if it is considered hate speech.

screenshot twitter dvanlangenhove 20180901

library(udpipe)
library(hatespeech.dutch)
detect_hatespeech("Europa wordt élke dag geteisterd door geweld van illegalen. 
 Zowel voor mensen die zich zorgen maken over dit geweld als voor mensen 
 die zich zorgen maken over de boze reactie van Europeanen òp dit geweld zou 
 oplossing duidelijk moeten zijn: alle illegalen opsporen en deporteren.", 
 type = "udpipe")

    Neutral-Country   Neutral-Migration Neutral-Nationality 
                  0                   1                   0 
   Neutral-Religion  Neutral-Skin_color      Racist-Animals 
                  0                   0                   0 
     Racist-Country        Racist-Crime      Racist-Culture 
                  0                   0                   0 
    Racist-Diseases    Racist-Migration  Racist-Nationality 
                  0                   0                   0 
        Racist-Race     Racist-Religion   Racist-Skin_color 
                  0                   0                   0 
 Racist-Stereotypes 
                  0

So apparently the dictionary logic considers this statement as Neutral-Migration. Hope the police have improved on the natural language processing a bit such that they have incorporated a bit more than just word lookup and regular expressions. Feel free to try the hate speech detector out on your own text using the R package made available at https://github.com/weRbelgium/hatespeech.dutch. Or visit the website to see to the dictionaries which are used to detect hate speech.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.

08-09/10/2018: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to training@di-academy.com
15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
05-06/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

↧

udpipe version 0.7 for Natural Language Processing (#NLP) alongside #tidytext, #quanteda, #tm

September 11, 2018, 1:00 pm

≫ Next: Last call for the course on text mining of next week

≪ Previous: How to detect hatespeech in plain text #schildnvrienden

udpipe rlogo This blogpost announces the release of the udpipe R package version 0.7 on CRAN. udpipe is an R package which does tokenization, parts of speech tagging, lemmatization, morphological feature tagging and dependency parsing. It's main feature is that it is a lightweight R package which works on more than 50 languages and gives you rich NLP output out of the box.

The package was updated mainly in order to more easily work with the crfsuite R package which does entity/intent recogntion and chunking. The user-visible changes that were made are that udpipe now has a shorthand for working with text in the TIF format and it now also allows to indicate the location of the token inside the original text. Next to this, version 0.7 also caches the udpipe models.

Example

Using udpipe (version >= 0.7) works as follows. First download the model of your language and next do the annotation.

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
x <- udpipe("De federale regering besliste vandaag dat er een neutrale verpakking voor roltabak en sigaretten komt", 
            object = udmodel)

Since version 0.7, you can now also directly indicate the language. This will download the udpipe annotation model if it is not already downloaded. Please inspect the help of udpipe_download_model for more details on the languages available and the license of these.

x <- udpipe("Je veux qu’on me juge pour ce que je suis et non pour ce qu’était mon père", "french")
x <- udpipe("Europa lança taxas sobre navios para tirar lixo do fundo do mar.", "portuguese")
x <- udpipe("आपके इस स्नेह्पूर्ण और जोरदार स्वागत से मेरा हृदय आपार हर्ष से भर गया है। मैं आपको दुनिया के सबसे पौराणिक भिक्षुओं की तरफ से धन्यवाद् देता हूँ। मैं आपको सभी धर्मों की जननी कि तरफ से धन्यवाद् देता हूँ, और मैं आपको सभी जाति-संप्रदाय के लाखों-करोड़ों हिन्दुओं की तरफ से धन्यवाद् देता हूँ। मेरा धन्यवाद् उन वक्ताओं को भी जिन्होंने ने इस मंच से यह कहा कि दुनिया में शहनशीलता का विचार सुदूर पूरब के देशों से फैला है।", "hindi")
x <- udpipe("The economy is weak but the outlook is bright", "english")
x <- udpipe("Maxime y su mujer hicieron que nuestra estancia fuera lo mas comoda posible", "spanish")
x <- udpipe("A félmilliárdos MVM-támogatásból 433 milliót négy nyúlfarknyi imázsvideóra költött", "hungarian")
x <- udpipe("同樣,施力的大小不同,引起的加速度不同,最終的結果也不一樣,亦可以從向量的加成性來看", "chinese")

The result is a a data.frame with one row per doc_id and term_id containing all the tokens in the data, the lemma, the part of speech tags, the morphological features, the dependency relationship along the tokens and the location where the token is found in the original text.

screenshot udpipe chinese

Use alongside other R packages

R has a rich NLP ecosystem. If you want to use udpipe alongside other R packages, let's enumerate some basic possibilities where we show how to easily extract lemma's and text of parts of speech tags you are interested in:
Below we show how to use udpipe alongside the 3 popular R packages: tidytext, quanteda and tm on the following data.frame in TIF format.

rawdata <- data.frame(doc_id = c("doc1", "doc2"), 
                      text = c("The economy is weak but the outlook is bright.", 
                               "Natural Language Processing has never been more easy than this."), 
                      stringsAsFactors = FALSE)

Using tidytext

In this code, we let tidytext do tokenisation and use udpipe to enrich the token list. Next we subset the data.frame of tokens by extracting only proper nouns, nouns and adjectives.

library(tidytext)
library(udpipe)
library(dplyr)
x <- unnest_tokens(rawdata, input = text, output = word)
x <- udpipe(split(x$word, x$doc_id), "english")
x <- filter(x, upos %in% c("PROPN", "NOUN", "ADJ"))

Using quanteda

In the code below, we let udpipe do tokenisation and provide the lemma's back in quanteda's tokens element.

library(quanteda)
library(udpipe)
x <- corpus(rawdata, textField = "text")
tokens <- udpipe(texts(x), "english")
x$tokens <-  as.tokenizedTexts(split(tokens$lemma, tokens$doc_id))

Using tm

Below, we get only the lemma's of the nouns, proper nouns and adjectives and apply this using the tm_map functionality from tm.

library(tm)
library(udpipe)
x <- VCorpus(VectorSource(rawdata$text))
x <- tm_map(x, FUN=function(txt){
  data <- udpipe(content(txt), "english")
  data <- subset(data, upos %in% c("PROPN", "NOUN", "ADJ"))
  paste(data$lemma, collapse = " ")
})

UDPipe currently already uses deep learning techniques (e.g a GRU network) for doing the tokenisation but the dependency parsing was enhanced in 2018 by incorporating tensorflow. On the roadmap for a next release will be the integration of the UDPipe future enhancements (which got 3rd place at the CoNLL shared task from 2018) including these tensorflow components.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months. rtraining

08-09/10/2018: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to training@di-academy.com
15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
05-06/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

↧

Last call for the course on text mining of next week

October 2, 2018, 9:58 pm

≫ Next: crfsuite for natural language processing

≪ Previous: udpipe version 0.7 for Natural Language Processing (#NLP) alongside #tidytext, #quanteda, #tm

Last call for the 2-day course on Text Mining with R, held next week (08-09 October 2018) in Brussels, Belgium. Subscribe at https://www.eventbrite.co.uk/e/dsb2018-text-mining-with-r-jan-wijffels-bnosac-session-03-04-tickets-50586501588

You'll learn during that course the following:

Cleaning of text data, regular expressions
String distances
Graphical displays of text data
Natural language processing: stemming, parts-of-speech tagging, tokenization, lemmatisation, dependency parsing, noun phrase detection and keyword extraction
Entity recognition & chunking using Conditional Random Fields
Sentiment analysis
Statistical topic detection modelling (latent dirichlet allocation)
Visualisation of correlations & topics
Automatic classification using predictive modelling based on text data
Word and Text embeddings
Document similarities & Text alignment

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months. rtraining

08-09/10/2018: Text mining with R. Brussels (Belgium). Subscribe at https://www.eventbrite.co.uk/e/dsb2018-text-mining-with-r-jan-wijffels-bnosac-session-03-04-tickets-50586501588
15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
05-06/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

↧

crfsuite for natural language processing

October 29, 2018, 12:43 am

≫ Next: Starspace for NLP #nlproc

≪ Previous: Last call for the course on text mining of next week

A new R package called crfsuite supported by BNOSAC landed safely on CRAN last week. The crfsuite package (https://github.com/bnosac/crfsuite) is an R package specific to Natural Language Processing and allows you to easily build and apply models for

named entity recognition
text chunking
part of speech tagging
intent recognition or
classification of any category you have in mind

The focus of the implementation is on allowing the R user to build such models on his/her own data, with your own categories. The R package is a Rcpp interface to the popular crfsuite C++ package which is used a lot in all kinds of chatbots.

In order to facilitate creating training data on your own data, a shiny app is made available in this R package which allows you to easily tag your own chunks of text, with your own categories, which can next be used to build a crfsuite model. The package also plays nicely together with the udpipe R package (https://CRAN.R-project.org/package=udpipe), which you need in order to extract predictive features (e.g. parts of speech tags) for your words to be used in the crfsuite model.

On a side-note. If you are in the area of NLP, you might also be interested in the upcoming ruimtehol R package which is a wrapper around the excellent StarSpace C++ code providing word/sentence/document embeddings, text-based classification, content-based recommendation and similarities as well as entity relationship completion.

app screenshot

You can get going with the crfsuite package as follows. Have a look at the package vignette, it shows you how to construct and apply your own crfsuite model.

## Install the packages
install.packages("crfsuite")
install.packages("udpipe")

## Look at the vignette
library(crfsuite)
library(udpipe)
vignette("crfsuite-nlp", package = "crfsuite")

More details at the development repository https://github.com/bnosac/crfsuite where you can also provide feedback.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months. rtraining

20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

↧

Starspace for NLP #nlproc

December 4, 2018, 12:04 am

≫ Next: You did a sentiment analysis with tidytext but you forgot to do dependency parsing to answer WHY is something positive/negative

≪ Previous: crfsuite for natural language processing

Our recent addition to the NLP R universe is called R package ruimtehol which is open sourced at https://github.com/bnosac/ruimtehol This R package is a wrapper around Starspace which provides a neural embedding model for doing the following on text:

Text classification
Learning word, sentence or document level embeddings
Finding sentence or document similarity
Ranking web documents
Content-based recommendation (e.g. recommend text/music based on the content)
Collaborative filtering based recommendation (e.g. recommend text/music based on interest)
Identification of entity relationships

logo ruimtehol

If you are an R user and are interested in NLP techniques. Feel free to test out the framework and provide feedback at https://github.com/bnosac/ruimtehol/issues. The package is not on CRAN yet, but can be installed easily with the command devtools::install_github("bnosac/ruimtehol", build_vignettes = TRUE).

Below is an example how the package can be used for multi-label classification on questions asked in Belgian parliament. Each question in parliament was labelled with several of one of the 1785 categories.

library(ruimtehol)
data(dekamer, package = "ruimtehol")

## Each question in parliament was labelled with more than 1 category. There are 1785 categories in this dataset
dekamer$question_themes <- strsplit(dekamer$question_theme, " +\\| +")
## Plain text of the question in parliament
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- sapply(dekamer$text, FUN=function(x) paste(x, collapse = " "))
dekamer$text <- tolower(dekamer$text)

## Build starspace model
model <- embed_tagspace(x = dekamer$text, 
                        y = dekamer$question_themes, 
                        dim = 50, 
                        ngram = 3, loss = "hinge", similarity = "cosine", adagrad = TRUE,
                        early_stopping = 0.8, minCount = 2, 
                        thread = 4)

## Get embeddings of the dictionary of words as well as the categories
embedding_words  <- as.matrix(model, type = "words")
embedding_labels <- as.matrix(model, type = "label")

## Find closest labels / predict
embedding_combination <- starspace_embedding(model, "federale politie patrouille", type = "document")
embedding_similarity(embedding_combination, 
                     embedding_labels, 
                     top_n = 3)

term1                                            term2 similarity rank
federale politie patrouille           __label__POLITIE  0.8480641    1
federale politie patrouille          __label__OPENBARE  0.6919607    2
federale politie patrouille __label__BEROEPSMOBILITEIT  0.6907637    3

predict(model, "de migranten komen naar europa, in asielcentra ...")
$input
"de migranten komen naar europa, in asielcentra ..."
$prediction
                label               label_starspace similarity
 VLUCHTELINGENCENTRUM __label__VLUCHTELINGENCENTRUM  0.7075160
          VLUCHTELING          __label__VLUCHTELING  0.6253517
             ILLEGALE             __label__ILLEGALE  0.5997692
       MIGRATIEBELEID       __label__MIGRATIEBELEID  0.5939595
           UITWIJZING           __label__UITWIJZING  0.5376520

The list of R packages regarding text mining with R provided by BNOSAC has been steadily growing. This is the list of R packages maintained by BNOSAC.

udpipe: tokenisation, lemmatisation, parts of speech tagging, dependency parsing, morphological feature extraction, sentiment scoring, keyword extraction, NLP flows
crfsuite: named entity recognition, text classification, chunking, sequence modelling
textrank: text summarisation
ruimtehol: text classification, word/sentence/document embeddings, document/label similarities, ranking documengs, content based recommendation, collaborative filtering-based recommendation

More details of ruimtehol at the development repository https://github.com/bnosac/ruimtehol where you can also provide feedback.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months. rtraining

19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

↧

You did a sentiment analysis with tidytext but you forgot to do dependency parsing to answer WHY is something positive/negative

January 8, 2019, 2:16 pm

≫ Next: Neural Text Modelling with R package ruimtehol

≪ Previous: Starspace for NLP #nlproc

A small note on the growing list of users of the udpipe R package. In the last month of 2018, we've updated the package on CRAN with some noticeable changes

The default models which are now downloaded with the function udpipe_download_model are now models built on Universal Dependencies 2.3 (released on 2018-11-15)
This means udpipe now has models for 60 languages. That's right! And they provide tokenisation, parts of speech tagging, lemmatisation and dependency parsing built on all of these treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb.
Although this was not intended originally we added a sentiment scoring function in the latest release (version 0.8 on CRAN). Combined with the output of the dependency parsing, this allows to answer questions like 'WHAT IS CAUSING A NEGATIVE SENTIMENT'. Example showing below.
If you want to use the udpipe models for commercial purposes, we have some nice extra pretrained models available for you - get in touch if you are looking for this.

Below we will showcase the new features of the R package by finding out what is causing a negative sentiment.

If I see some users of the tidytext sentiment R package I always wondered if they do sentiment scoring for the love of building reports as it looks like the main thing they report is frequency of occurrences of words which are part of a positive or negative dictionary. While probably their manager asked them. "Yeah but why is the sentiment negative or positive".
You can answer this managerial question using dependency parsing and that is exactly what udpipe provides (amongst other NLP annotations). Dependency parsing links each word to another word, allowing us the find out which words are linked to negative words giving you the context of why something is negative and what needs to be improved in your business. Let's show how to get this easily done in R.

Below we get a sample of 500 AirBnb customer reviews in French, annotate it with udpipe (using a French model built on top of Rhapsodie French treebank), use the new sentiment scoring txt_sentiment which is available in the new udpipe release using an online dictionary of positive / negative terms for French. Next we use the udpipe dependency parsing output by looking to the adjectival modifier 'amod' in the dep_rel udpipe output and visualise all words which are linked the the negative terms of the dictionary. The result is this graph showing words of the dictionary in red and words which are linked to that word in another color.

sentiment and dependency parsing

Full code showing how this is done is shown below.

library(udpipe)
library(dplyr)
library(magrittr)
data(brussels_reviews, package = "udpipe")
x <- brussels_reviews %>% 
  filter(language == "fr") %>%
  rename(doc_id = id, text = feedback) %>%
  udpipe("french-spoken", trace = 10)
##
## Get a French sentiment dictionary lexicon with positive/negative terms, negators, amplifiers and deamplifiers
##
load(file("https://github.com/sborms/sentometrics/raw/master/data-raw/FEEL_fr.rda"))
load(file("https://github.com/sborms/sentometrics/raw/master/data-raw/valence-raw/valShifters.rda"))
polarity_terms <- rename(FEEL_fr, term = x, polarity = y)
polarity_negators <- subset(valShifters$valence_fr, t == 1)$x
polarity_amplifiers <- subset(valShifters$valence_fr, t == 2)$x
polarity_deamplifiers <- subset(valShifters$valence_fr, t == 3)$x
##
## Do sentiment analysis based on that open French lexicon
##
sentiments <- txt_sentiment(x, term = "lemma", 
                            polarity_terms = polarity_terms,
                            polarity_negators = polarity_negators, 
                            polarity_amplifiers = polarity_amplifiers,
                            polarity_deamplifiers = polarity_deamplifiers)
sentiments <- sentiments$data

Nothing fancy happened here above. We use udpipe for NLP annotation (tokenisation, lemmatisation, parts of speech tagging and dependency parsing). The sentiment scoring not only does a join with the sentiment dictionary but also looks for neighbouring words which might change the sentiment.
The resulting dataset looks like this

udpipe enriched

Now we can answer the question - why is something negative

This is done by using the dependency relationship output of udpipe to find out which words are linked to negative words from our sentiment dictionary. Users unfamiliar with dependency relationships, have a look at definitions of possible tags for the dep_rel field at dependency parsing output. In this case we only take 'amod' meaning we are looking for adjectives modifying a noun.

## Use cbind_dependencies to add the parent token to which the keyword is linked
reasons <- sentiments %>% 
  cbind_dependencies() %>%
  select(doc_id, lemma, token, upos, sentiment_polarity, token_parent, lemma_parent, upos_parent, dep_rel) %>%
  filter(sentiment_polarity < 0)
head(reasons)

Now instead of making a plot showing which negative words appear which tidytext users seem to be so keen of, we can make a plot showing the negative words and the words which these negative terms are linked to indicating the context of the negative term.
We select the lemma's of the negative words and the lemma of the parent word and calculate how many times they occur together

reasons <- filter(reasons, dep_rel %in% "amod")
word_cooccurences <- reasons %>% 
  group_by(lemma, lemma_parent) %>%
  summarise(cooc = n()) %>%
  arrange(-cooc)
vertices <- bind_rows(
  data_frame(key = unique(reasons$lemma)) %>% mutate(in_dictionary = if_else(key %in% polarity_terms$term, "in_dictionary", "linked-to")),
  data_frame(key = unique(setdiff(reasons$lemma_parent, reasons$lemma))) %>% mutate(in_dictionary = "linked-to"))

The following makes the visualisation using ggraph.

library(magrittr)
library(ggraph)
library(igraph)
cooc <- head(word_cooccurences, 20)
set.seed(123456789)
cooc %>%  
  graph_from_data_frame(vertices = filter(vertices, key %in% c(cooc$lemma, cooc$lemma_parent))) %>%
  ggraph(layout = "fr") +
  geom_edge_link0(aes(edge_alpha = cooc, edge_width = cooc)) +
  geom_node_point(aes(colour = in_dictionary), size = 5) +
  geom_node_text(aes(label = name), vjust = 1.8, col = "darkgreen") +
  ggtitle("Which words are linked to the negative terms") +
  theme_void()

This generated the image shown above, showing context of negative terms. Now go do this on your own data.

If you are interested in the techniques shown above, you might also be interested in our recent open-sourced NLP developments:

textrank: text summarisation
crfsuite: entity recognition, chunking and sequence modelling
BTM: biterm topic modelling on short texts (e.g. survey answers / twitter data)
ruimtehol: neural text models on top of Starspace (neural models for text categorisation, word/sentence/document embeddings, document recommendation, entity link completion and entity embeddings)
udpipe: general NLP package for tokenisation, lemmatisation, parts of speech tagging, morphological annotations, dependency parsing, keyword extraction and NLP flows

Enjoy!

↧

Neural Text Modelling with R package ruimtehol

January 15, 2019, 12:48 am

≫ Next: An overview of the NLP ecosystem in R (#nlproc #textasdata)

≪ Previous: You did a sentiment analysis with tidytext but you forgot to do dependency parsing to answer WHY is something positive/negative

Last week the R package ruimtehol was released on CRAN (https://github.com/bnosac/ruimtehol) allowing R users to easily build and apply neural embedding models on text data.

It wraps the 'StarSpace' library https://github.com/facebookresearch/StarSpace allowing users to calculate word, sentence, article, document, webpage, link and entity 'embeddings'. By using the 'embeddings', you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph 'embeddings' as well as perform semi-supervised learning and multi-task learning on plain text. The techniques are explained in detail in the paper: 'StarSpace: Embed All The Things!' by Wu et al. (2017), available at https://arxiv.org/abs/1709.03856.

You can get started with some common text analytical use cases by using the presentation we have built below. Enjoy!

{aridoc engine="pdfjs" width="100%" height="550"}images/bnosac/blog/R_TextMining_Starspace.pdf{/aridoc}

If you like it, give it a star at https://github.com/bnosac/ruimtehol and if you need commercial support on text mining, get in touch.

Upcoming training schedule

Note also that you might be interested in the following courses held in Belgium

21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

↧

An overview of the NLP ecosystem in R (#nlproc #textasdata)

February 4, 2019, 4:07 am

≫ Next: Making thematic maps for Belgium

≪ Previous: Neural Text Modelling with R package ruimtehol

At BNOSAC, R is used a lot to perform text analytics as it is an excellent tool that provides anything a data scientist needs to perform data analysis on text in a business settings. For users unfamiliar with all the possibilities that the wealth of R packages offers regarding text analytics, we've made this small mindmap showing a list of techniques and R packages that are used frequently in text mining projects set up by BNOSAC. Download the image and let your eyes zoom in on the different topics. Hope it broadens your idea of what is possible. Want to learn more or get hands on: http://www.bnosac.be/index.php/training

NLP R ecosystem

↧

Making thematic maps for Belgium

February 25, 2019, 11:53 pm

≫ Next: Human Face Detection with R

≪ Previous: An overview of the NLP ecosystem in R (#nlproc #textasdata)

For people from Belgium working in R with spatial data, you can find excellent workshop material on creating thematic maps for Belgium at https://workshop.mhermans.net/thematic-maps-r/index.html. The workshop was given by Maarten Hermans from HIVA - Onderzoeksinstituut voor Arbeid en Samenleving.

The plots are heavily based on BelgiumMaps.Statbel - an R package from bnosac released 2 years ago (more info at http://www.bnosac.be/index.php/blog/55-belgiummaps-statbel-r-package-with-administrative-boundaries-of-belgium)

↧

Human Face Detection with R

March 22, 2019, 12:57 am

≫ Next: Koning Filip lijkt op ...

≪ Previous: Making thematic maps for Belgium

Doing human face detection with computer vision is probably something you do once unless you work for police departments, you work in the surveillance industry or for the Chinese government. In order to reduce the time you lose on that small exercise, bnosac created a small R package (source code available at https://github.com/bnosac/image) which wraps the weights of a Single Shot Detector (SSD) Convolutional Neural Network which was trained with the Caffe Deep Learning kit. That network allows to detect human faces in images. An example is shown below (tested on Windows and Linux).

install.packages("magick")
install.packages("image.libfacedetection", repos = "https://bnosac.github.io/drat")

library(magick)
library(image.libfacedetection)
image <- image_read("http://bnosac.be/images/bnosac/blog/wikipedia-25930827182-kerry-michel.jpg")
faces <- image_detect_faces(image)
faces
plot(faces, image, border = "red", lwd = 7, col = "white")

libfacedetection example

What you get out of this is for each face the x/y locations and the width and height of the face. If you want to extract only the faces, loop over the detected faces and get them from the image as shown below.

allfaces <- Map(
    x      = faces$detections$x,
    y      = faces$detections$y,
    width  = faces$detections$width,
    height = faces$detections$height,
    f = function(x, y, width, height){
      image_crop(image, geometry_area(x = x, y = y, width = width, height = height))
    })
allfaces <- do.call(c, allfaces)
allfaces

Hope this gains you some time when doing which seems like a t-test of computer vision. Want to learn more on computer vision, next time just follow our course on Computer Vision with R and Python: https://lstat.kuleuven.be/training/coursedescriptions/ComputervisionwithRandPython

↧

Koning Filip lijkt op ...

March 26, 2019, 12:55 am

≫ Next: Transfer learning and semi-supervised learning with ruimtehol

≪ Previous: Human Face Detection with R

Last call for the course on Text Mining with R, held next week in Leuven, Belgium on April 1-2. Viewing the course description as well as subscription can be done at https://lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r

Some things you'll learn ... is that King Filip of Belgium is similar to public expenses if we just look at open data from questions and answers in Belgian parliament (retrieved from here http://data.dekamer.be). Proof is below. See you next week. koning filip

library(ruimtehol)
library(data.table)
library(lattice)
library(latticeExtra)
data("dekamer", package = "ruimtehol")
dekamer$x <- strsplit(dekamer$question, "\\W")
dekamer$x <- lapply(dekamer$x, FUN = function(x) setdiff(x, ""))
dekamer$x <- sapply(dekamer$x, FUN = function(x) paste(x, collapse = " "))
dekamer$x <- tolower(dekamer$x)
dekamer$y <- strsplit(dekamer$question_theme, split = ",")
dekamer$y <- lapply(dekamer$y, FUN=function(x) gsub(" ", "-", x))
set.seed(321)
model <- embed_tagspace(x = dekamer$x, y = dekamer$y, 
                        early_stopping = 0.8, validationPatience = 10,
                        dim = 50, 
                        lr = 0.01, epoch = 40, loss = "softmax", adagrad = TRUE, 
                        similarity = "cosine", negSearchLimit = 50,
                        ngrams = 2, minCount = 2)embedding_words  <- as.matrix(model, type = "words")
embedding_labels <- as.matrix(model, type = "labels", prefix = FALSE)
embedding_person <- starspace_embedding(model, tolower(c("Theo Francken")))
embedding_person <- starspace_embedding(model, tolower(c("Koning Filip")))
similarities <- embedding_similarity(embedding_person, embedding_words, top = 9)
similarities <- subset(similarities, !term2 %in% c("koning", "filip"))
similarities$term <- factor(similarities$term2, levels = rev(similarities$term2))
plt1 <- barchart(term ~ similarity | term1, data = similarities, 
         scales = list(x = list(relation = "free"), y = list(relation = "free")),
         col = "darkgreen", xlab = "Similarity", main = "Koning Filip lijkt op ...")similarities <- embedding_similarity(embedding_person, embedding_labels, top = 7)
similarities$term <- factor(similarities$term2, levels = rev(similarities$term2))
plt2 <- barchart(term ~ similarity | term1, data = similarities, 
         scales = list(x = list(relation = "free"), y = list(relation = "free")),
         col = "darkgreen", xlab = "Similarity", main = "Koning Filip lijkt op ...")
c(plt1, plt2)

↧

Transfer learning and semi-supervised learning with ruimtehol

May 13, 2019, 10:23 pm

≫ Next: update of udpipe

≪ Previous: Koning Filip lijkt op ...

Last week the R package ruimtehol was updated on CRAN giving R users who perform Natural Language Processing access to the possibility to

Allow to do semi-supervised learning (learning where you have both text as labels but not always both of them on the same document identifier.
Allow to do transfer learning by passing on an embedding matrix (e.g. obtained via fasttext or Glove or the like) and keep on training based on that matrix or just use the embeddings in your Natural Language Processing flow.

More information can be found in the package vignette shown below or which you can obtain by installing the package and visiting the vignette with the following R code. Enjoy!

install.packages("ruimtehol")
vignette("ground-control-to-ruimtehol", package = "ruimtehol")

{aridoc engine="pdfjs" width="100%" height="550"}images/bnosac/blog/ground-control-to-ruimtehol.pdf{/aridoc}

↧

update of udpipe

July 23, 2019, 11:38 pm

≫ Next: dependency parsing with udpipe

≪ Previous: Transfer learning and semi-supervised learning with ruimtehol

I'm happy to announce that the R package udpipe was updated recently on CRAN. CRAN now hosts version 0.8.3 of udpipe. The main features incorporated in the update include

parallel NLP annotation across your CPU cores
default models now use models trained on Universal Dependencies 2.4, allowing to do annotation in 64 languages, based on 94 treebanks from Universal Dependencies. We now have models built on afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb
some fixes as indicated in the NEWS file

How does parallel NLP annotation looks like right now? Let's do some annotation in French.

library(udpipe)
data("brussels_reviews", package = "udpipe")
x <- subset(brussels_reviews, language %in% "fr")
x <- data.frame(doc_id = x$id, text = x$feedback, stringsAsFactors = FALSE)
anno <- udpipe(x, "french-gsd", parallel.cores = 1, trace = 100)
anno <- udpipe(x, "french-gsd", parallel.cores = 4) ## this will be 4 times as fast if you have 4 CPU cores
View(anno)

Note that udpipe particularly works great in combination with the following R packages

crfsuite for entity recognition (more docs here)
textrank for text summarisation (more docs here)
BTM for topic modelling on short texts (more docs here)
ruimtehol for doing text classification, text recommendation and finding similaries between articles, sentences, words, bigrams, labels, tags, persons, websites, entities and entity relations (more docs here and here)

And nothing stops you from using R packages tm / tidytext / quanteda or text2vec alongside it!

Upcoming training schedule

If you want to know more, come attend the course on text mining with R or text mining with Python. Here is a list of scheduled upcoming public courses which BNOSAC is providing each year at the KULeuven in Belgium.

2019-10-17&18: Statistical Machine Learning with R: Subscribe here
2019-11-14&15: Text Mining with R: Subscribe here
2019-12-17&18: Applied Spatial Modelling with R: Subscribe here
2020-02-19&20: Advanced R programming: Subscribe here
2020-03-12&13: Computer Vision with R and Python: Subscribe here
2020-03-16&17: Deep Learning/Image recognition: Subscribe here
2020-04-22&23: Text Mining with R: Subscribe here
2020-05-05&06: Text Mining with Python: Subscribe here

↧

dependency parsing with udpipe

July 24, 2019, 11:32 am

≫ Next: upcoming AI-related courses

≪ Previous: update of udpipe

We have been blogging about udpipe several times now in the following posts:

Dependency parsing

A point which we haven't touched upon yet too much was dependency parsing. Dependency parsing is an NLP technique which provides to each word in a sentence the link to another word in the sentence, which is called it's syntactical head. This link between each 2 words furthermore has a certain type of relationship giving you further details about it.

The R package udpipe provides such a dependency parser. With the output of dependency parsing, you can answer questions like

What is the nominal subject of a text
What is the object of a verb
Which word modifies a noun
What is the linked to negative words
Which words are compound statements
What are noun phrases, verb phrases in the text

Examples

In the following sentence:

His speech about marshmallows in New York is utter bullshit

you can see this dependency parsing in action in the graph below. You can see compound statement like 'New York', that the word speech is linked to the word bullshit with relationship nominal subject, that the 2 nominals marshmallow and speech are linked as nominal noun modifiers, that the word utter is an adjective which modifies the noun bullshit.

depenceny parsing example2

Obtaining such relationships in R is pretty simple nowadays. Running this code, will provide you the dependency relationships among the words of the sentence in the columns token_id, head_token_id and dep_rel. The possible values in the field dep_rel are defined at https://universaldependencies.org/u/dep/index.html.

library(udpipe)
x <- udpipe("His speech about marshmallows in New York is utter bullshit", "english")

depenceny parsing example1

R is excellent in visualisation. For visualising the relationships between the words which were found, you can just use the ggraph R package. Below we create a basic function which selects the right columns from the annotation and puts it into a graph.

library(igraph)
library(ggraph)
library(ggplot2)
plot_annotation <- function(x, size = 3){
  stopifnot(is.data.frame(x) & all(c("sentence_id", "token_id", "head_token_id", "dep_rel",
                                     "token_id", "token", "lemma", "upos", "xpos", "feats") %in% colnames(x)))
  x <- x[!is.na(x$head_token_id), ]
  x <- x[x$sentence_id %in% min(x$sentence_id), ]
  edges <- x[x$head_token_id != 0, c("token_id", "head_token_id", "dep_rel")]
  edges$label <- edges$dep_rel
  g <- graph_from_data_frame(edges,
                             vertices = x[, c("token_id", "token", "lemma", "upos", "xpos", "feats")],
                             directed = TRUE)
  ggraph(g, layout = "linear") +
    geom_edge_arc(ggplot2::aes(label = dep_rel, vjust = -0.20),
                  arrow = grid::arrow(length = unit(4, 'mm'), ends = "last", type = "closed"),
                  end_cap = ggraph::label_rect("wordswordswords"),
                  label_colour = "red", check_overlap = TRUE, label_size = size) +
    geom_node_label(ggplot2::aes(label = token), col = "darkgreen", size = size, fontface = "bold") +
    geom_node_text(ggplot2::aes(label = upos), nudge_y = -0.35, size = size) +
    theme_graph(base_family = "Arial Narrow") +
    labs(title = "udpipe output", subtitle = "tokenisation, parts of speech tagging & dependency relations")
}

We can now call the function as follows to get the plot shown above:

plot_annotation(x, size = 4)

Let us see what is gives with the following sentence.

The economy is weak but the outlook is bright

x <- udpipe("The economy is weak but the outlook is bright", "english")
plot_annotation(x, size = 4)

depenceny parsing example3

You can see that with dependency parsing you can now answer the question 'What is weak?', it is the economy. 'What is bright?', it is the outlook as these nouns relate to the adjectives with nominal subject as type of relationship. That's a lot more rich information than just looking at wordclouds.

Hope this has triggered beginning users of natural language processing that there is a myriad of NLP options beyond mere frequency based word counting. Enjoy!

↧

upcoming AI-related courses

October 15, 2019, 10:37 am

≫ Next: Neural Network Machine Learning for NLP

≪ Previous: dependency parsing with udpipe

I forgot to do some marketing for the following upcoming AI-related courses which will be given in Leuven, Belgium by BNOSAC

2019-10-17&18: Statistical Machine Learning with R: Subscribe here
2019-11-14&15: Text Mining with R: Subscribe here
2019-12-17&18: Applied Spatial Modelling with R: Subscribe here
2020-02-19&20: Advanced R programming: Subscribe here
2020-03-12&13: Computer Vision with R and Python: Subscribe here
2020-03-16&17: Deep Learning/Image recognition: Subscribe here
2020-04-22&23: Text Mining with R: Subscribe here
2020-05-06&07: Text Mining with Python: Subscribe here

Hope to see you there.

↧

Neural Network Machine Learning for NLP

January 21, 2020, 2:24 am

≫ Next: Last call for the course on Advanced R programming

≪ Previous: upcoming AI-related courses

Last week, we updated package ruimtehol on CRAN. The package provides an easy interface for R users to Starspace which is a general purpose neural embedding model for text data.

Notable changes are that the package now also builds fine on Mac OS and runs fine on all CRAN platforms. If you are interested to see what the package can do, have a look at the presentation below or visit the package vignette at https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf

If you like it, give it a star at https://github.com/bnosac/ruimtehol and if you need commercial support on text mining, get in touch.

Upcoming training schedule

Interested in NLP? Then you might as well be interested in the following courses provided in Belgium. Hope to see you there!

2020-02-19&20: Advanced R programming: Subscribe here
2020-03-12&13: Computer Vision with R and Python: Subscribe here
2020-03-16&17: Deep Learning/Image recognition: Subscribe here
2020-04-22&23: Text Mining with R: Subscribe here
2020-05-06&07: Text Mining with Python: Subscribe here

↧

Last call for the course on Advanced R programming

February 10, 2020, 12:19 am

≫ Next: Corona in Belgium

≪ Previous: Neural Network Machine Learning for NLP

Next week we will hold our yearly course on Advanced R programming at LStat, Leuven. If you are interested in learning one of the following techniques, don't hesitate to subscribe at https://lstat.kuleuven.be/training/coursedescriptions/AdvancedprogramminginR.html

Functions, the apply family of functions, parallelisation, advanced data manipulation with R
S3 programming
Building reports with markdown / Sweave
Build an R package

Interested in other trainings, vist: http://bnosac.be/index.php/training

See you next week!

↧

Corona in Belgium

March 12, 2020, 7:14 am

≫ Next: Biterm topic modelling for short texts

≪ Previous: Last call for the course on Advanced R programming

I lost a few hours this afternoon when digging into the Corona virus data mainly caused by reading this article at this website which gives a nice view on how to be aware of potential issues which can arise when collecting data and to be aware of hidden factors and it also shows Belgium.

As a Belgian, I was interested to see how Corona might impact our lives in the next weeks and out of curiosity I was interested to see how we are doing compared to other countries regarding containment of the Corona virus outspread - especially since we still do not have a government in Belgium after elections 1 year ago.
In what follows, I'll be showing some graphs using data available at https://github.com/CSSEGISandData/COVID-19 (it provides up-to-date statistics on Corona cases). If you want to reproduce this, pull the repository and just execute the following R code shown.

Data

Let's see first if the data is exactly what is shown at our National Television.

library(data.table)
library(lattice)
x <- list.files("csse_covid_19_data/csse_covid_19_daily_reports/", pattern = ".csv", full.names = TRUE)
x <- data.frame(file = x, date = substr(basename(x), 1, 10), stringsAsFactors = FALSE)
x <- split(x$file, x$date)
x <- lapply(x, fread)
x <- rbindlist(x, fill = TRUE, idcol = "date")
x$date <- as.Date(x$date, format = "%m-%d-%Y")
x <- setnames(x, 
              old = c("date", "Country/Region", "Province/State", "Confirmed", "Deaths", "Recovered"),
              new = c("date", "region", "subregion", "confirmed", "death", "recovered"))
x <- subset(x, subregion %in% "Hubei" | 
               region %in% c("Belgium", "France", "Netherlands", "Spain", "Singapore", "Germany", "Switzerland", "Italy"))
x$area <- ifelse(x$subregion %in% "Hubei", x$subregion, x$region)
x <- x[!duplicated(x, by = c("date", "area")), ]
x <- x[, c("date", "area", "confirmed", "death", "recovered")]
subset(x, area %in% "Belgium" & confirmed > 1)

Yes, the data from https://github.com/CSSEGISandData/COVID-19 looks correct indeed. Same numbers as reported on the Belgian Television.

date	area	confirmed	death	recovered
2020-03-01	Belgium	2	0	1
2020-03-02	Belgium	8	0	1
2020-03-03	Belgium	13	0	1
2020-03-04	Belgium	23	0	1
2020-03-05	Belgium	50	0	1
2020-03-06	Belgium	109	0	1
2020-03-07	Belgium	169	0	1
2020-03-08	Belgium	200	0	1
2020-03-09	Belgium	239	0	1
2020-03-10	Belgium	267	0	1
2020-03-11	Belgium	314	3	1

Exponential number of cases of Corona

Now is the outbreak really exponential? Let's make some graphs.

What is clear when looking at the plots is that indeed infections happen at a exponential scale except in Singapore where the government managed to completely isolate the Corona cases, while in Belgium and other European countries the government lacked the opportunity to isolate the Corona cases and we are now in a phase of trying to slow down to reduce and spread the impact.

corona1

You can reproduce the plot as follows

trellis.par.set(strip.background = list(col = "lightgrey"))
xyplot(confirmed ~ date | area, data = x, type = "b", pch = 20, 
       scales = list(y = list(relation = "free", rot = 0), x = list(rot = 45, format = "%A %d/%m")), 
       layout = c(5, 2), main = sprintf("Confirmed cases of Corona\n(last date in this graph is %s)", max(x$date)))

Compare to other countries - onset

It is clear that the onset of Corona is different in each country. Let's define the onset (day 0) as the day where 75 persons had Corona in the country. That will allow us to compare different countries. In Belgium we started to have more than 75 patients with Corona on Friday 2020-03-06. In the Netherlands that was one day earlier.

date	area	confirmed
2020-01-22	Hubei	444
2020-02-17	Singapore	77
2020-02-23	Italy	155
2020-02-29	Germany	79
2020-02-29	France	100
2020-03-01	Spain	84
2020-03-04	Switzerland	90
2020-03-05	Netherlands	82
2020-03-06	Belgium	109

Reproduce as follows:

x <- x[order(x$date, x$area, decreasing = TRUE), ]
x <- x[, days_since_case_onset := as.integer(date - min(date[confirmed > 75])), by = list(area)]
x <- x[, newly_confirmed := as.integer(confirmed - shift(confirmed, n = 1, type = "lead")), by = list(area)]
onset <- subset(x, days_since_case_onset == 0, select = c("date", "area", "confirmed"))
onset[order(onset$date), ]

Compare to other countries - what can we expect?

Now are we doing better than other countries in the EU?

Following plot shows the log of the number of people diagnosed as having Corona since the onset date shown above. It looks like Belgium has learned a bit from the issues in Italy but it still hasn't learned the way to deal with the virus outbreak the same as e.g. Singapore has done (a country which learned from the SARS outbreak).

Based on the blue line, we can expect Belgium to have next week between roughly 1100 confirmed cases (log(1100)=7) or if we follow the trend of France that would be roughly 3000 (log(3000)=8) patients with Corona. We hope that it is only the first.

corona2

Reproduce as follows:

xyplot(log(confirmed) ~ days_since_case_onset | "Log(confirmed cases) of Corona since onset of sick person nr 75", 
 groups = area,
 data = subset(x, days_since_case_onset >= 0 & 
                  area %in% c("Hubei", "France", "Belgium", "Singapore", "Netherlands", "Italy")), 
 xlab = "Days since Corona onset (confirmed case 75)", ylab = "Log of number of confirmed cases",
 auto.key = list(space = "right", lines = TRUE),
 type = "b", pch = 20, lwd = 2)

Compared to the Netherlands

Now, are we doing better than The Netherlands?

Currently it looks like we are. But time will tell. Given the trend shown above, I can only hope everyone in Belgium follows the government guidelines as strict as possible.

corona3

Reproduce as follows:

xyplot(newly_confirmed ~ date | "Newly confirmed cases of Corona", groups = area,
 data = subset(x, area %in% c("Belgium", "Netherlands") & date > as.Date("2020-03-01")), 
 xlab = "Date", ylab = "Number of new Corona cases",
 scales = list(x = list(rot = 45, format = "%A %d/%m", at = seq(as.Date("2020-03-01"), Sys.Date(), by = "day"))), 
 auto.key = list(space = "right", lines = TRUE),
 type = "b", pch = 20, lwd = 2)

↧