Pretraining of a Swiss Long Legal BERT Model

We will scrape legal text in German, French and Italian to pretrain a Swiss Long Legal BERT model capable of performing NLP tasks better in the Swiss legal domain.

Factsheet

Situation

We see a clear research gap that BERT models capable of handling long mul- tilingual text are currently underexplored (gap 1). Additionally, to the best of our knowledge, there is no multilingual legal BERT model available yet (gap 2). Tay et al. 2020b present a benchmark for evaluating BERT-like models capable of handling long input and conclude preliminarily that BigBird Zaheer et al., 2020 is the currently best performing variant.

Course of action

We thus propose to pretrain a BERT-like model (likely BigBird) on multi- lingual long text to fill the first research gap. To fill the second gap, we propose to further pretrain Gururangan et al., 2020 this model on multilingual legal text.

This project contributes to the following SDGs

  • 9: Industry, innovation and infrastructure
  • 16: Peace, justice and strong institutions