Text categorization problems (e.g., indexing, filtering, routing) are recurrent in various natural language processing tasks. This presentation focuses on the text style to provide answers to various questions (author profiling (determining the author gender, age or psychological traits) or author verification, etc.). In this talk, we will cover the following aspects. First, according to the target application, a text representation must be determined. We will show that a relatively large variability does exist to achieve a good text representation. Some of them could be viewed as weird but could be useful for cross-lingual or cross-domain applications. Second, a feature selection could be applied to extract the best subset of predictors. Third, a machine learning (ML) model must be selected and employed with the choice between a simple ML model or a deep learning solution. Four, an evaluation procedure is used to measure the effectiveness of different solutions. But this is not the final point. Then the data analyst must provide an answer to questions such as: which predictors are the most influential for the model? How would the predictions change if a given predictor is rising? What are features that differentiate the writing style of men vs. women? To provide an answer to these questions, we will show that explainable machine learning models could provide a comprehensive response.
Large models have drastically changed the natural language processing landscape. Nowadays, pre-trained language models represent the de-facto standard to implement NLP models even in areas where labelled data is scarce and hard to obtain. Although this works fine for general-purpose applications for the NLP domain, it proves to be complex in highly-specific domains. Additionally, using those models for server-side inferencing is challenging and deploying them for on-device applications in constrained devices is completely impractical due to their size and inference cost. In this talk, we focus on alternatives to transformer-based architectures and efficient NLP and show that weight-efficient models can reach competitive performance in several tasks with model sizes in the order of a few megabytes.
Handwriting on digital surfaces, done either with a finger or a stylus, is one of the many ways to use digital ink. In this talk we focus on two tasks – handwriting recognition and digital ink synthesis that together give stylus users access to functionality that matches and exceeds that of keyboard users. We show how handwriting recognition is far from a solved task and discuss complexities in solving it in resource constrained environments and in multilingual settings. Furthermore, we discuss the recent advances in both tasks and connections between. Finally, we explore the connections to the specificities of the Swiss language ecosystem.
This talk will provide an overview of the the evolving adoption of natural language processing techniques by Swiss companies and other organizations during the last 15 years. Indeed, while the first encounters with NLP in the 2000’s were mainly in terms of efficient document management (information extraction and classification), several organizations took an interest in conversational interfaces around the 2010’s. Chatbots have been seen as an entry point into deeper customer insights and more efficient knowledge access, but somehow addressed only part of the business needs – often times, with disappointing results.
During the last five years, the advent of large language models and efficient representation learning have offered new impact opportunities – even in very human-centered fields such as the job marketplace and humanitarian activity.
The most recent renaissance of conversational language models, joining key aspects of previous promising technologies, offers yet new perspectives that the Swiss market is eager to take. This comes in hand with an increased awareness of the value of explainability and the issue of trust.
The continued growth of Large Language Models (LLMs) and their wide-scale adoption in commercial applications such as chatGPT make it increasingly important to (a) develop ways to source their training data in a more transparent way, and (b) to investigate it, both for research and for ethical issues. This talk will discuss the current state of affairs, the ongoing developments in LLM regulation and data documentation, and also present some data governance lessons learned from Big Science, an open-source effort to train a multilingual LLM.
Newspapers or magazines currently often publish articles in print or online in almost identical versions, maybe just adding links to related content or a user forum online. For further enriching the original article, we investigated methods to create different types of content automatically with state-of-the-art generative language models: (1) Keynotes, (2) a summary, and (3) a content-related quiz (i.e., a question with answers).
Our study is based on 2.5k German articles from the Swiss daily newspaper “Tages-Anzeiger”. Our evaluation with a professional editor showed that content generated by ChatGPT from OpenAI outperformed traditional NLP methods as well as human-generated content. At the venue, we plan to showcase a Web-based demo application and report on our experience with using the OpenAI API for automatic text generation in comparison with other NLP methods and human performance.
Public services rely on data to perform their functions and to document their decisions. They are legally required to archive such data. Archives, in return, are legally required to make these data accessible to the public, while protecting the privacy of people and legal entities. Anonymization provides a well-established trade-off between transparent governance and the right to privacy.
Depending on the data type, anonymization presents various levels of difficulty. It can be achieved with relative ease when dealing with structured statistical tables. When dealing with textual data, elements identifying people and legal entities are not immediately available, but must be located among innocuous words and phrases prior to any attempt at redaction. To ensure economic efficiency, public services seek to automatize this process. Available solutions resort to named entity recognition (NER) driven by machine learning (ML). Such tools have shown their efficiency in many cases, and many pre-trained NER models already exist. But it is difficult to use such models on metadata such as document titles. In effect, such titles rarely form natural sentences; which means that they cannot be processed by pretrained models, while they can still contain sensitive personal information. For instance, if a person’s name occurs in a title of a document whose position in the archival tree implies legal prosecution, this mere appearance already gives away sensitive information; even without access to the document itself. Hence the need for the Swiss Federal Archives (SFA) to anonymize a substantial part of its metadata in order to be able to publish it, in concordance with the federal law.
The SFA has commissioned the Text Crunching Center (TCC) of the University of Zurich to develop a system capable of detecting such sensitive metadata. The experience has shown that a satisfying level of automatic detection can be trained with less than 2000 manually annotated metadata entries. This by a combination of rule-based detection and of an Ensemble System consisting of three NER algorithms: Conditional Random Fields, Support Vector Machines and a Multilayer Perceptron. This tool allowed the SFA to publish metadata for 1.4 million extra records on its web portal.
In our presentation, we will discuss the legal context, as well as the training and the evaluation process of this tool. We will also make a live demonstration. The tool is now published on GitHub (https://github.com/SwissFederalArchives/tcc-metadata-anonymization) in concordance with the open-source software development policy of the SFA.
VIAN-DH is a recent web application developed by Linguistic Research Infrastructure, University of Zurich, intended for the integrated analysis of audiovisual elements of human communication, such as speech, gesture, facial expressions and more. VIAN-DH aims at bridging two approaches to the analysis of human communication: conversation analysis/interactional linguistics (IL), so far a dominantly qualitative field, and computational/corpus linguistics and its quantitative and automated methods.
Contemporary IL investigates the systematic organization of conversations and interactions composed of speech, gaze, gestures, body positioning among others. These highly integrated multimodal behavior is analyzed based on video data aimed at uncovering so called “multimodal gestalts”, patterns of linguistic and embodied conduct that reoccur in specific sequential positions employed for specific purposes.
Multimodal analyses (and other disciplines using videos) are so far dependent on time and resource intensive processes of manual transcription of each component from video materials. Automating these tasks requires advanced programming skills, which is often not in the scope of IL. Moreover, the use of different tools makes the integration and the analysis of different formats challenging. Consequently, IL research often deals with relatively small samples of annotated data which are suitable for qualitative analysis, but not enough for making generalized empirical claims derived quantitatively.
VIAN-DH aims to create a workspace where many annotation layers required for multimodal analysis of videos can be created, processed, and combined. VIAN-DH will provide a graphical interface that operates state-of the-art tools for automating parts of the data processing. The integration of tools from computational linguistics and computer vision, facilitates data processing, speeds up the overall research process, and enables processing of large amounts of data. The main features to be introduced are automatic speech recognition for transcription of language, pose estimation for extraction of gestures and other visual cues, as well as grammatical annotation for adding morphosyntactic information to the verbal content.
In order to view and search the data, VIAN-DH will provide a unified format and enable the import of the main existing formats of annotated video data and the export to other formats used in the field, while integrating different data source formats in a way that they can be combined in research. VIAN-DH will adapt querying methods from corpus linguistics to enable parallel search of many annotation levels, combining token-level and chronological search for various types of data.
VIAN-DH strives to bring crucial innovation to the fields analyzing human communicative behavior from video materials. It will allow processing of large amounts of data automatically and, implementation of quantitative analyses, combining it with the qualitative approach. It will facilitate the investigation of correlations between linguistic patterns (lexical or grammatical) with conversational aspects (turn-taking or gestures). Users will be able to automatically transcribe and annotate visual, spoken and grammatical information from videos, and to correlate those different levels and perform queries and analyses.
With the rise of Generative Large Language Models (LLM), companies across a variety of sectors are looking into the opportunities proffered by this new technology, in its multifaceted incarnations. One area of particular interest is the automated handling of customer requests –received through channels such as email, chat, social media, etc.– using the institutional knowledge at hand. Such knowledge is typically represented in a mix of documents, and, to respond to requests with accuracy, a system often needs to combine information from multiple sources, with the entire knowledge base as a “context”. An implication of the above is that, when using a LLM for that purpose, one needs to “adapt” it to the domain at hand, and to the factual knowledge of the individual organization. It also means that the model will need to have access, in one way or another, to privileged, non-public information in the company’s knowledge base. To satisfy these requirements, one would need to have models that: (a) can be trained / fine-tuned / otherwise adapted with reasonable resources; and, (b) can be prepared and used on-premise, to avoid sending private information outside the company’s infrastructure. Towards these goals, we are exploring how open-source language models can be customized for handling customer requests by utilizing prompt engineering, the ReAct paradigm, and model fine-tuning. Our primary focus is on the German language. At present, we are evaluating the BLOOM and LLaMA models and their derivatives. Given the rapid advances in the field, the scope of the work may be adjusted between the time of writing and the SwissText conference to take into consideration new developments. In this talk we will discuss the different approaches we have investigated, the necessary data and hardware requirements for implementing them, and share the results of our research.
New drugs are risky and costly to develop. “Drug repositioning” or “drug repurposing” describes the well-known practice in identifying new uses for already existing drugs or active compounds. Using a case study, this paper describes ongoing research about the exploration of the potential in using NLP techniques on publicly available data sources to identify drugs for glioblastoma therapy not documented in established standardized databases.
In this work, a pretrained instance of RoBERTa is used to extract named entities from resumes. These named entities can be used to form a knowledge base, summarize the resumes, evaluate whether the resumes contain the prerequisites mentioned in the job offers and finally to pre-fill application forms during a job application. After transfer learning performed on a corpus of 112 resumes, the model achieves an average F1-Score of 0.72 across the 22 named entity classes.
Transformers pretrained on unfiltered corpora are known to contain biases. The biases presented in this work appear when the model is used to extract the first name contained in an artificially generated sentence. These first names are associated with ethnic groups and genders (male or female). Especially, the model more easily extracts first names associated with white men. Some suggestions to mitigate the bias are presented in the discussion section.
Recent breakthroughs in NLP largely increased the presence of ASR systems in our daily lives. How- ever, for many low-resource languages, ASR models still need to be improved due in part to the difficulty of acquiring pertinent data. This project aims to help advance research in ASR models for Swiss German dialects, by providing insights about the performance of state-of-the-art ASR models on recently published Swiss German speech datasets. We propose a novel loss that takes into account the semantic distance between the predicted and the ground-truth labels. We outperform current state-of-the-art results by fine-tuning OpenAI’s Whisper model on Swiss German datasets.
The growing capabilities of transformer models pave the way for solving increasingly complex NLP tasks. A key to supporting application-specific requirements is the ability to fine-tune. However, compiling a fine-tuning dataset tailored to complex tasks is tedious and results in large datasets, limiting the ability to control transformer output. We present an approach in which complex tasks are divided into simpler subtasks. Multiple transformer models are fine-tuned to one subtask each, and lined up to accomplish the complex task. This simplifies the compilation of fine-tuning datasets and increases overall controllability. Using the example of reducing gender bias as a complex task, we demonstrate our approach and show that it performs better than using a single model.
Text summarization is an important downstream natural language processing (NLP) task that challenges both the understanding and generation capabilities of language models. Thanks to large language models (LLMs) and techniques for fine-tuning models in machine learning, significant progress has been made in automatically summarizing short texts such as news articles, often leading to very satisfactory machine-generated results. In contrast, summarizing long documents still remains a major challenge. This is partly due to the complex nature of contextual information in long texts, but also due to the lack of open-source benchmarking datasets and the corresponding evaluation frameworks that can be used to develop and test model performance. In this work, we use ChatGPT, the latest breakthrough in the field of NLP and LLMs, together with the extractive summarization model C2F-FAR (Coarse-to-Fine Facet-Aware Ranking) to propose a hybrid extraction and summarization pipeline for long documents such as business articles and books. We work with the world-renowned company getAbstract AG and leverage their expertise and experience in professional book summarization. A practical study has shown that machine-generated summaries can perform at least as well as human-written summaries when evaluated using current automated evaluation metrics. However, a closer examination of the texts generated by ChatGPT through human evaluations has shown that there are still critical issues in terms of text coherence, faithfulness, and style. Overall, our results show that the use of ChatGPT is a promising but not yet mature approach for summarizing long documents and can at best serve as an inspiration for human editors. We anticipate that our work will inform NLP researchers about the extent to which ChatGPT’s capabilities for summarizing long documents overlap with practitioners’ needs. Further work is needed to test such a hybrid summarization pipeline, in particular involving GPT-4, and to propose a new evaluation framework tailored to the task of summarizing long documents.
Prompting is used to guide or steer a language model in generating an appropriate response that is consistent with the desired outcome.
Chaining is a concept that is applied to break down more complex tasks into smaller, manageable tasks.
We use prompt chaining for long legal document classification tasks, since they are difficult to process at once due to the complex domain-specific language as well as the length.
First, we create a concise summary of the original document.
Second, we use semantic search to retrieve related exemplar texts with corresponding annotations from a training corpus.
Finally, we prompt for a label – based on the task – to assign, by leveraging the in-context learning from the few-shot prompt.
We show that we can improve the performance via prompt chaining over zero-shot and even outperforming ChatGPT zero-shot with smaller models in terms of the micro-f1 score.
We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland – German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at <url anonymized for review>.
We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh. Since Romansh is an unseen language, we are dealing with a zero-shot setting. Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align, a statistical model, and is on par with similarity-based word alignment for seen languages. We interpret these results as evidence that mBERT contains information that can be meaningful and applicable to Romansh.
To evaluate performance, we also present a new trilingual corpus, which we call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton of Grisons in German, Romansh and Italian in the past 25 years. The corpus contains 4 547 parallel documents and approximately 100 000 sentence pairs in each language combination. We additionally present a gold standard for German-Romansh word alignment. The data is available at https://github.com/eyldlv/DERMIT-Corpus.
Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity while retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: (1) preservation of the original meaning, and (2) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models.
We introduce Parallel Paraphrasing (Para-both), an augmentation method for translation metrics making use of automatic paraphrasing of both the reference and hypothesis. This method counteracts the typically misleading results of speech translation metrics such as WER, CER, and BLEU if only a single reference is available. We introduce two new datasets explicitly created to measure the quality of metrics intended to be applied to Swiss German speech-to-text systems. Based on these datasets, we show that we are able to significantly improve the correlation with human quality perception if our method is applied to commonly used metrics.
In Switzerland, two thirds of the population speak Swiss German, a primarily spoken language with no standardised written form. It is widely used on Swiss TV, for example in news reports, interviews and talk shows, and captions are required for people who do not understand this spoken language. This paper focuses on the second part of a cascade approach for the automatic Standard German captioning of spoken Swiss German. We apply a multilingual pre-trained model to translate automatic speech recognition of Swiss German into Standard German suitable for captioning. Results of several evaluations, both human and automatic, show that the system succeeds in improving the content, but is currently not capable of producing entirely correct Standard German.
We present 20 Minuten, a dataset for abstractive text summarisation of German news articles. The corpus fills a gap in summarisation resources for the German language and includes multiple professionally written and stylistically distinct summaries, along with captioned images and document-level reading times. In this paper, we conduct baseline experiments with mT5 and compare the performance of fine-tuning in both single and multi-task settings on six different downstream tasks supported by our dataset. Our results reveal that dedicated models are preferable, especially for the more distinct tasks. We make our dataset available for research and provide the code to facilitate handling and model training.
The Battle for NLP Ideas is a collaborative events where participants group together to brainstorm innovative ideas based on the latest NLP technologies. The best ideas are presented on stage and shared with the whole conference.
Reranking the k best hypothesis parse trees from an existing parser allows to take into account more information that the model has gathered during training than simply decoding the most likely dependency tree. In this paper, we first investigate whether state-of-the-art parsers can still benefit from reranking in low-resource languages. As part of this analysis, we deliver new insights concerning rerankability. Second, we propose a novel approach to mixture reranking by using a reject option for the reranker, which paves the way for designing interpretable reranking-based parsing systems in the future.
This paper presents a deep learning based model to detect the completeness and correctness of a sentence. It’s designed specifically for detecting errors in speech recognition systems and takes several typical recognition errors into account, including false sentence boundary, missing words, repeating words and false word recognition. The model can be applied to evaluate the quality of the recognized transcripts, and the optimal model reports over 90.5% accuracy on detecting whether the system completely and correctly recognizes a sentence.
This paper is discussing a review of different text classification models, both the traditional ones, as well as the state-of-the-art models. Simple models under review were the Logistic Regression, naïve Bayes, k-Nearest Neighbors, C-Support Vector Classifier, Linear Support Vector Machine Classifier, and Random Forest. On the other hand, the state-of-the-art models used were classifiers that include pretrained embeddings layers, namely BERT or GPT-2. Results are compared among all of these classification models on two multiclass datasets, ‘Text_types’ and ‘Digital’, addressed later on in the paper. While BERT was tested both as a multiclass as well as a binary model, GPT-2 was used as a binary model on all the classes of a certain dataset. In this paper we showcase the most interesting and relevant results. The results show that for the datasets on hand, BERT and GPT-2 models perform the best, though the BERT model outperforms GPT-2 by one percentage point in terms of accuracy. It should be born in mind that these two models were tested on a binary case though, whereas the other ones were tested on a multiclass case. The models that performed the best on a multiclass case are C-Support Vector Classifier and BERT. To establish the absolute best classifier in a multiclass case, further research is needed that would deploy GPT-2 on a multiclass case.
Recently, online customer reviews have surged in popularity, placing additional demands on businesses to respond to these reviews.
Conditional text generation models, trained to generate a response given an input review have been proposed to facilitate human authors in composing high quality responses.
However, this approach has been shown to yield rather unsatisfying, generic responses while, in practice, responses are required to address reviews specifically and individually.
We hypothesise that this issue could be tackled by changing the alignment paradigm and using sentence-aligned training data instead of document-aligned. Yet, finding correct sentence alignments in the review-response document pairs is not trivial.
In this paper we investigate methods to align sentences based on computing the surface and semantic similarity between source and target pairs and benchmark performance for this rather challenging alignment problem.
Voice assistants understanding dialects would help especially elderly people. Automatic Speech Recognition (ASR) performs poorly on dialects due to the lack of sizeable datasets. We propose three adaptation strategies which allow to improve an ASR model trained for German language to understand Swiss German spoken by a target speaker using as little as 1.5 hours of speaker data. Our best result was a word error rate (WER) of 0.27 for one individual.
We propose a novel type of document representation that preserves textual, visual, and spatial information without containing any sensitive data. We achieve this by transforming the original visual and textual data into simplified encodings. These pieces of non-sensitive information are combined into a tensor to form the NonDisclosureGrid (NDGrid). We demonstrate its capabilities on information extraction tasks and show, that our representation matches the performance of state-of-the-art representations and even outperforms them in specific cases.
In supervised classification tasks, a machine learning model is provided with an input, and after the training phase, it outputs one or more labels from a fixed set of classes. Recent developments of large pre-trained language models (LLMs), such as BERT, T5 and GPT-3, gave rise to a novel approach to such tasks, namely prompting.
In prompting, there is usually no further training required (although fine-tuning is still an option), and instead, the input to the model is extended with an additional text specific to the task – a prompt. Prompts can contain questions about the current sample, examples of input-output pairs or task descriptions. Using prompts as clues, a LLM can infer from its implicit knowledge the intended outputs in a zero-shot fashion.
Legal prompt engineering is the process of creating, evaluating, and recommending prompts for legal NLP tasks. It would enable legal experts to perform legal NLP tasks, such as annotation or search, by simply querying large LLMs in natural language.
In this presentation, we investigate prompt engineering for the task of legal judgement prediction (LJP). We use data from the Swiss Federal Supreme Court and the European Court of Human Rights, and we compare various prompts for LJP using multilingual LLMs (mGPT, GPT-J-6B, etc.) in a zero-shot manner. We find that our approaches achieve promising results, but the long documents in the legal domain are still a challenge compared to single sentence inputs.
For the compliance and legal profession, the exponential growth of data is both a threat and a promise. A threat, because finding crucial facts buried in hundreds of thousands of documents is hard. A promise, because proper management, analysis, and interpretation of data provides a competitive advantage and full transparency during all stages of investigations.
In this talk, we present our novel text analytics platform Herlock.ai to leverage these possibilities.
Herlock.ai finds mentions of persons, dates, and locations in the corpus and makes these findings available to the user. In order to achieve this, several hurdles have to be overcome. Paper documents need to become machine readable. Even when the digital version exactly replicates the paper, the data is not available for analysis because of human inconsistencies and errors.
Herlock.ai fixes these problems and provides excellent content. In order to support the user in their work, Herlock.ai needs to be easy to use and understand, e.g. by splitting documents into meaningful parts, by comparing different variants, and by marking textual anomalies.
We will do a demo of Herlock.ai. The platform has been used in a recent Swiss legal case that has received high media coverage. 500 federal folders that physically fill entire walls of shelves were an unprecedented challenge for the involved parties. The quick search and navigation is a key tool and the analytics we provided were used for official submissions to the court.
The project “Schweizer Dialektsammlung” (“Swiss Dialect Collection”) has been running since spring 2021. Its goal is to collect a large dataset with Swiss German audio samples and their transcriptions to Standard German text. So far, we have crowdsourced 200 hours of audio from nearly 4000 volunteers via a web recording platform, equivalent to over 150’000 text prompts. The dataset is called SDS-200 and will be released for research purposes.
In a related project funded by the Schweizer Nationalfonds (SNF), we are using SDS-200 together with parallel dialect data to find out how Swiss German Speech-to-text (STT) systems can better recognise dialects for which little annotated data is available. Initial experimental results show that including SDS-200 as part of the training data significantly enhances STT performance: the BLEU score on the All Swiss German Dialects Test Set improves from 48 to 65 when we add SDS-200.
We are also planning the next phase of “Schweizer Dialektsammlung”, where users can form teams and compete for prizes.
We will
– present the project and the data collected so far
– discuss our Speech-to-Text experiments and results
– talk about lessons learnt
– provide an outlook of planned future activities in data collection and systems development
Transcribing Swiss German speech automatically into Standard German text is a challenging Speech-to-Text (STT) translation task. For the last two years, we at FHNW worked on the development of an end-to-end system to solve this task. In cooperation with ZHAW we also created a 35 hours test corpus which contains 7 x 5 hours of audios with transcripts of Swiss newspaper articles spoken in 7 Swiss German dialects (Basel, Bern, Graubünden, Innerschweiz, Ostschweiz, Wallis, and Zürich). Thus, for each region, we collected a total of 3600 spoken sentences from at least 10 different speakers.
We use this test set to objectively quantify the quality of our STT system and compare it to two commercial STT services for Swiss German to Standard German.
We evaluated all three STT systems using our test set and we present a fair comparison using our carefully designed test corpus. We discuss weaknesses and strengths of the three models in terms of the different dialects and other aspects.
Agenda Introduction about – Who we are – Our Firm – The Vision Starts – Our purpose & Activities – Beyond Expectations – Spoiler Alert:/bs.com/career – Conversation Banking – the project
Daniel Mazzolini Head of UBS-BSC Manno
Vladislav Matic Lugo Product Owner Advanced Analytics
Drawing on our network of around 200 branches and, 4,600 client advisors, complemented by modern digital banking services and customer service centers, we are able to reach approximately 80% of Swiss wealth.
Conversational Banking
Vision Bring natural language as a new way for interacting with digital clients and employees, enhancing user experience and increasing efficiency
Ambition For our clients: Offer digital services via conversational interfaces
For our employees: Provide a virtual assistant for knowledge workers, call agents and client advisors along most important business domains
Why Conversational Baking – In 2020 more than +3m requests were raised to our support units – ~40% of queries are trivial in nature and have an associated self-service option or info materials – Common requests for queries are information, navigation, update, order. – Common questions are: What is it? Where can I find it? How to do it?
Leveraging cloud cognitive services for Conversational Banking use cases
Cloud native application in Switzerland – Leverage Microsoft Cognitive Services – Cloud
Large scale transformer models have become the de-facto standard in academia. However, until now few (non tech) companies have actually developed and globally scaled transformer-based data products, leading to a dearth of industry case studies. That said, at Zurich a data science team has developed a general purpose transformer based document extraction solution that was first piloted in 2018 – 2019 and later to over 10 markets globally, enabling the automated processing of millions of highly complex and diverse input documents (emails, pdfs, scans, voice to text messages etc.).
In this presentation the team will outline the opportunities and challenges of scaling such models in the financial services industry, outlining key technology and business considerations to successfully deploy and scale them in an industry setting. The importance of research collaborations with universities will also be covered in this talk.
Swiss software company WellD and its SaaS hotel tech spinoff
TellTheHotel wish to develop a closed-domain task-oriented
conversational agent for the hospitality industry.
As part of the Innosuisse-funded project TACO “Closed-domain
task-oriented conversational agents with embedded intelligence”, SUPSI
is helping WellD and TellTheHotel leverage the state of the art of NLP
to build a custom conversational agent. The main objective is the
development of a multi-language, multichannel, digital concierge that
enables end users to complete a hotel reservation as well as ancillary
activities through a natural conversational flow.
The conversational agent is based on customized cutting-edge NLP
techniques and the RASA framework. Basic hotel reservation requests
are handled with multi-language intent detection, which is carried out
through the application of BERT-based cross-lingual sentence
embeddings, with substantial benefits compared to translation-based
systems. User queries that go beyond room reservations are handled
with a BERT model fine-tuned for question answering.
From an architectural point of view, the project is developed based on
micro-services and deployed as a Kubernetes cluster to ensure
scalability.
Every day we write some text. We try to write grammatically correct and politely where we have spelling and grammar checkers to support us. But many people struggle to ensure that their writing does not contain deterring words. Research shows that using non-inclusive language (especially gendered, ageism, and racist words) has a particularly strong impact on missing a large share of potential talents in the labor market. We developed a smart tool Witty that can assist in automatically detecting and suggesting alternatives to deterring language, enabling inclusive writing.
The core Witty algorithm is based on Natural Language Processing (NLP) advanced technologies. We combine a rules-based approach and modern transformer architecture. We created our own glossaries (German, English) with inclusive and non-inclusive words with the help of our highly trained language specialist and based on studies and research in that field. We use NLP pre-trained models for German and English correspondingly (SpaCy). We transform the words in the text into dictionary-like forms, perform linguistic analysis, extract the linguistic features from the text, and also do named entity recognitions to extract geographic location, organization’s name, people, numbers from the user text if needed.
Currently, we are implementing the transformers (BERT, Hugging Face) to identify the right meaning of the words properly and classify the job-related text and perform the sentiment analysis of the text.
Libraries of technical expert information have been written in free text. Such texts are usually authored by experts in a semi-formal style. How can this valuable information be extracted into structured and useful representations?
The human touch of these texts renders rule based approaches useless. Annotating enough samples for an ML model might be too expensive. We show an approach essentially combining both worlds by using an off-the-shelf dependency parser together with tree-based rules of extraction.
Any syntax tree – even if it happens to be incorrect as in the below illustration – with its phrases and their relations do have the right format for rule-based extraction in this context (illustration only visible in PDF version). Rules detect syntactic relations and adpositions and build the structured output. Enumerations can easily be determined and extracted as lists.
We demonstrate the effectiveness of this methodology on a catalog from CRB, which has been standardizing processes in the Swiss construction industry for over 60 years. Expert authors write books of detailed definitions on walls, tunnels, canalisation etc. and specify which types of concrete can be used, for which purpose the wall is meant, what dimensions they should have. This is then used by contractors to formulate clear offers. For the next generation of standards a structured representation of the legacy catalogs is extracted by the above methodology.
We present a joint project between the Laboratory for Web Science at the FFHS (Fernfachhochschule Schweiz) and a start-up company, Skills Finder AG. It is funded by InnoSuisse. The goal of the project is to build a platform for processing job application documents that automates the extraction of relevant information from a candidate’s CV and the validation of this information with the candidate’s references and certificates. Our processing pipeline uses the most recent advances in the field of both image document processing and Natural Language Processing (NLP).
The first step in the processing of any document is to extract the text from it by taking a proper reading order into consideration. As CVs have very diverse layouts, none of the existing tools could extract correctly the text from them. To detect these complex layouts, we train a CV Layout Model by using a Deep Layout Parser (layout-parser.github.io), a unified toolkit for deep learning-based document image analysis.
The next step is an NLP component for the information extraction. We use Named Entity Recognition (NER) with a pretrained BERT multi-language model, where we fine tune the model on our dataset with a custom labeling.
The last step in our processing pipeline is the information validation. We calculate a semantic similarity between word phrases by using output feature vectors from the BERT model and the mean pooling. Our model achieves more than 80% accuracy on the skills extraction.
Classification of long documents is still a domain for classical machine learning techniques such as TF-IDF or BM25 with Support Vector Machines. Transformers and LSTMs do not scale well with the document length at training and inference time. For patents, this is a critical handicap since the key innovation is often described towards the end of the patent description, which varies in structure and length and can be relatively long.
Furthermore, because the class ontology for patents is very deep, specific classification can only be performed by looking at the differences that might be named in any part of the document. Therefore, it is advantageous to process the whole patent and not only specific parts.
We investigate hierarchical approaches that break down documents into smaller parts and other heuristics, such as summarization and hotspot detection, for Bert and PatentBERT and compare them to classical methods. The dataset was downloaded from the European patent office (EPO).
Automatic Speech Recognition (ASR) has numerous applications, including chatbots, transcription of meetings, subtiteling of TV shows, or automatic translation of conference presentations. For this reason, Speech-to-Text (STT) is a very active field of research, and tremendous progress has been made in the past years, in particular by using pretrained language models such as wave-to-vec and its derivates. On the other hand, several ready-to-use solutions exist, from international corporations such as Google or IBM to specialized providers such as Trint or Speechmatics to open-source frameworks such as Fairseq or DeepSpeech.
But how do you find the “best” ASR engine? Grounded decisions in this respect typically require an in-depth comparison of the performance of ASR engines on various annotated corpora. In order to simplify this process, we have developed a framework that allows to easily run and evaluate benchmarks on arbitrary ASR engines.
In this presentation, we introduce the framework itself as well as insights from our research on extensive benchmark experiments on various ASR engines. Among other things, we answer the following questions: How well do ASR engines perform on different types of speech, e.g. spontaneous vs. read-aloud? Can you combine several engines to achieve better results? How can you distinguish automatically between minor errors (e.g. singular vs. plural) and semantically significant errors (e.g. “cat” instead of “car”)
Large Language Models (LLMs) have led to large improvements in state-of-the-art results across language and image understanding and generation tasks. However, while they are largely capable to produce human-quality text in terms of grammaticality and, increasingly, coherence and relevance, it is often hard to distinguish whether the output of such models is grounded in actual knowledge or hallucinated. This talk will describe some recent work for knowledge infusion with the aim of improving the factuality of LLMs for question-answering tasks, as well as retrieval-based approaches leveraging LLMs for the same purpose.
Social media listening (SML) has the potential to help in many stages of the drug development process in the quest for patient-centric therapies that are fit-for-purpose and meaningful to patients. To fulfill this potential, however, it requires the leveraging of new quantitative approaches and analytical methods that draw from developments in NLP and real-world data (RWD) analysis applied to the real-world text (RWT) of social media. These approaches can be described under the umbrella term of quantitative SML (QSML) to distinguish them from the qualitative methods that have been commonly used. In this talk, I will describe what QSML is, why it is used and how it can support drug development, as well as ethical and legal considerations.
In my talk, I will discuss the issue of sparsity and separation of linguistic resources, showing how it can be overcome by following the practices developed by the Linguistic Linked Open Data (LLOD) community. After introducing the principles of the Linked Data paradigm, I will report a number of benefits of applying such principles to linguistic resources, also to fit the FAIR guiding principles for scientific data management. A use-case of resources currently published as LLOD will be then presented, namely the LiLa Knowledge Base, i.e. a collection of multifarious linguistic resources for Latin described and interlinked with the same vocabulary of knowledge description, by using common data categories and ontologies. The talk will detail the architecture of LiLa, whose core component consists of a large collection of Latin lemmas, serving as the backbone to achieve interoperability between the resources, by linking all those entries in lexical resources and tokens in corpora that point to the same lemma. In particular, the talk will focus on how lexical and textual resources are interlinked in the Knowledge Base. Three on-line services developed by LiLa will be presented as well, namely: a user-friendly interface to query the (meta)data interlinked in the Knowledge Base, the SPARQL endpoint of LiLa, and a tool to link automatically a raw Latin text to LiLa. Finally, the talk will discuss a number of challenges and open issues concerning interoperability between linguistic resources in infrastructures like CLARIN.