Linguistic landscapes

1 Linguistic landscapes

As part of the seminar, we had to wander in the 13th arrondissement of Paris and take photos of text written on walls, displays, buildings, shops and try to get a sense of multilingualism in the neighborhood.

The course it not about NLP but since that’s what I do, I wanted to profit of the occasion to try new methods, like extracting texts from photos.

The different photos I took are located in the images/photos folder. I searched online and it seems like tesseract is a nice package to do just this. So let’s try it.

2 Libraries

Display code

library(tidyverse)
library(tesseract)
library(quanteda)
library(magick)
library(quanteda.textstats)
library(quanteda.textplots)
library(reactable)

3 Extract text from photos

Display code

setwd("images/photos/")
files <- list.files()
sentence <- list()
tesseract_download("fra")

[1] "C:\\Users\\Lucile\\AppData\\Local\\tesseract5\\tesseract5\\tessdata/fra.traineddata"

Display code

fr <- tesseract("fra")
t1 <- Sys.time()

for (photo in files) {
  cat("Photo processing of", photo, "\n")
  text <- magick::image_read(photo) %>%
  tesseract::ocr_data(engine = fr) %>%
  filter(confidence > 50)
  
  sentence[[photo]] <- text
  #texts[[photo]] <- bind_rows(texts,texts[[photo]])
}

Photo processing of agence.jpg 
Photo processing of ascenseur.jpg 
Photo processing of magasin.jpg 
Photo processing of porte.jpg 
Photo processing of porte2.jpg 
Photo processing of poubelle.jpg 
Photo processing of poubelle2.jpg 
Photo processing of vehicule.jpg 
Photo processing of vehicule2.jpg 
Photo processing of voisins.jpg 
Photo processing of voisins2.jpg

Display code

Sys.time()-t1

Time difference of 15.87275 secs

Display code

all_texts <- bind_rows(sentence, .id = "n_photo")

#image <- image_read("PXL_20230902_075947721.MP.jpg") %>%
  #image_ocr()

Display code

all_texts <- all_texts %>%
  filter(word != "|") %>%
  mutate(word = str_replace_all(word, "[[:punct:]]|\\d+|[[:cntrl:]]", ""))


text_df <- all_texts %>%
  group_by(n_photo) %>%
  summarize(sentence = paste(word, collapse = " "))

reactable(text_df, striped = TRUE)

4 Co-occurences network

Display code

set.seed(100)
toks <- text_df %>%
  pull(sentence) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove("|")
fcmat <- fcm(toks, context = "window", tri = FALSE)
feat <- names(topfeatures(fcmat, 30))
fcm_select(fcmat, pattern = feat) %>%
  textplot_network(min_freq = 1, vertex_labelsize = 1.5 * rowSums(.)/min(rowSums(.)))

5 Co-occurences network without stopwords

Display code

set.seed(100)
toks <- text_df %>%
  pull(sentence) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(pattern = c(stopwords("french"),"|"), padding = FALSE)
fcmat <- fcm(toks, context = "window", tri = FALSE)
feat <- names(topfeatures(fcmat, 30))
fcm_select(fcmat, pattern = feat) %>%
  textplot_network(min_freq = 1, vertex_labelsize = rowSums(.)/min(rowSums(.)))

6 Wordcloud

Display code

set.seed(10)
dfmat1 <- dfm(corpus(text_df$sentence),
              remove = c(stopwords("french"),"|"), remove_punct = TRUE) %>%
   dfm_trim(min_termfreq = 1)
    
# basic wordcloud
textplot_wordcloud(dfmat1)

7 Collocations

Display code

text_tokens <- tokens(text_df$sentence, remove_punct = TRUE)
# extract collocations
text_coll <- textstat_collocations(text_tokens, size = 2, min_count = 1)
# inspect
text_coll[1:6, 1:6]

    collocation count count_nested length   lambda        z
1      merci de     4            0      2 4.641181 4.390324
2        ne pas     4            0      2 7.968666 3.875207
3 bonne entente     2            0      2 7.393263 3.522138
4         de ne     2            0      2 3.685669 3.327393
5   entente des     1            0      2 4.679040 3.303471
6    vos verres     1            0      2 4.679040 3.303471

8 Sentiment analysis using DistilCamemBERT-Sentiment

Display code

from transformers import pipeline

sentiment = pipeline(
    task='text-classification',
    model="cmarkea/distilcamembert-base-sentiment",
    tokenizer="cmarkea/distilcamembert-base-sentiment"
)
result = sentiment(
    "Ne pas déposer miroirs et vitres verres à boire vaisselle Pour la tranquillité des riverains merci de ne pas jeter vos verres entre heures et heures Attention à ne pas vous coincer les doigts",
    top_k=None
  )
result

[{'label': '5 stars', 'score': 0.254049152135849}, {'label': '4 stars', 'score': 0.24574197828769684}, {'label': '3 stars', 'score': 0.19883131980895996}, {'label': '1 star', 'score': 0.15715321898460388}, {'label': '2 stars', 'score': 0.14422431588172913}]

--- title: "Linguistic landscapes" title-block-banner: true subtitle: "Paris' 13th arrondissement" author: - name: Olivier Caron email: olivier.caron@dauphine.psl.eu affiliations: name: "Paris Dauphine - PSL" city: Paris state: France date : "last-modified" toc: true number-sections: true number-depth: 5 format: html: theme: light: yeti dark: darkly code-fold: true code-summary: "Display code" code-tools: true #enables to display/hide all blocks of code code-copy: true #enables to copy code grid: body-width: 1000px margin-width: 100px toc: true toc-location: left execute: echo: true warning: false message: false editor: visual fig-align: "center" highlight-style: ayu css: styles.css reference-location: margin --- ## Linguistic landscapes As part of the seminar, we had to wander in the 13th arrondissement of Paris and take photos of text written on walls, displays, buildings, shops and try to get a sense of multilingualism in the neighborhood. The course it not about NLP but since that's what I do, I wanted to profit of the occasion to try new methods, like extracting texts from photos. The different photos I took are located in the `images/photos` folder. I searched online and it seems like tesseract is a nice package to do just this. So let's try it. ## Libraries ```{r} #| label: libraries #| message: false library(tidyverse) library(tesseract) library(quanteda) library(magick) library(quanteda.textstats) library(quanteda.textplots) library(reactable) ``` ## Extract text from photos ```{r} setwd("images/photos/") files <- list.files() sentence <- list() tesseract_download("fra") fr <- tesseract("fra") t1 <- Sys.time() for (photo in files) { cat("Photo processing of", photo, "\n") text <- magick::image_read(photo) %>% tesseract::ocr_data(engine = fr) %>% filter(confidence > 50) sentence[[photo]] <- text #texts[[photo]] <- bind_rows(texts,texts[[photo]]) } Sys.time()-t1 all_texts <- bind_rows(sentence, .id = "n_photo") #image <- image_read("PXL_20230902_075947721.MP.jpg") %>% #image_ocr() ``` ```{r} all_texts <- all_texts %>% filter(word != "|") %>% mutate(word = str_replace_all(word, "[[:punct:]]|\\d+|[[:cntrl:]]", "")) text_df <- all_texts %>% group_by(n_photo) %>% summarize(sentence = paste(word, collapse = " ")) reactable(text_df, striped = TRUE) ``` ## Co-occurences network ```{r} set.seed(100) toks <- text_df %>% pull(sentence) %>% tokens(remove_punct = TRUE) %>% tokens_tolower() %>% tokens_remove("|") fcmat <- fcm(toks, context = "window", tri = FALSE) feat <- names(topfeatures(fcmat, 30)) fcm_select(fcmat, pattern = feat) %>% textplot_network(min_freq = 1, vertex_labelsize = 1.5 * rowSums(.)/min(rowSums(.))) ``` ## Co-occurences network without stopwords ```{r} set.seed(100) toks <- text_df %>% pull(sentence) %>% tokens(remove_punct = TRUE) %>% tokens_tolower() %>% tokens_remove(pattern = c(stopwords("french"),"|"), padding = FALSE) fcmat <- fcm(toks, context = "window", tri = FALSE) feat <- names(topfeatures(fcmat, 30)) fcm_select(fcmat, pattern = feat) %>% textplot_network(min_freq = 1, vertex_labelsize = rowSums(.)/min(rowSums(.))) ``` ## Wordcloud ```{r} set.seed(10) dfmat1 <- dfm(corpus(text_df$sentence), remove = c(stopwords("french"),"|"), remove_punct = TRUE) %>% dfm_trim(min_termfreq = 1) # basic wordcloud textplot_wordcloud(dfmat1) ``` ## Collocations ```{r} text_tokens <- tokens(text_df$sentence, remove_punct = TRUE) # extract collocations text_coll <- textstat_collocations(text_tokens, size = 2, min_count = 1) # inspect text_coll[1:6, 1:6] ``` ## Sentiment analysis using DistilCamemBERT-Sentiment ```{python} from transformers import pipeline sentiment = pipeline( task='text-classification', model="cmarkea/distilcamembert-base-sentiment", tokenizer="cmarkea/distilcamembert-base-sentiment" ) result = sentiment( "Ne pas déposer miroirs et vitres verres à boire vaisselle Pour la tranquillité des riverains merci de ne pas jeter vos verres entre heures et heures Attention à ne pas vous coincer les doigts", top_k=None ) result ```