About

The project “KnEDLe - Knowledge Extraction from Documents of Legal content” is a partnership among FAPDF (Fundação de Apoio à Pesquisa do Distrito Federal), UnB (the University of Brasília) and Finatec (Fundação de Empreendimentos Científicos e Tecnológicos), sponsored by FAPDF. This project was proposed in order to employ official publications as a research object and to extract knowledge. The objective is to develop intelligent tools for extracting structured information from such publications, aiming to facilitate the search and retrieval of information, increasing government transparency and facilitating audit tasks and detecting problems related to the use of public resources.

Official publications such as the Diário Oficial do Distrito Federal (DODF) are sources of information on all official government acts. Although these documents are rich in knowledge, analysing these texts manually by specialists is a complex and unfeasible task considering the growing volume of documents, the result of the frequent number of publications in the Distrito Federal Government's (GDF) communication vehicle.



This scenario is appropriate to employ computational techniques based on text mining and information visualization, in order to discover implicit and relevant knowledge in large textual data sets. It is known that these computational techniques receive data in a structured format. However, as DODF editions are originally published in unstructured format and in natural language, it is required to use techniques to prepare strategies in order to make the necessary adaptations to apply.

Publications

Inferring the source of official texts: can SVM beat ULMFiT?

• Pedro H. Luz de Araujo • Teófilo E. de Campos • Marcelo M. Silva de Sousa

PROPOR 2020 - click here for source code and resources

"Official Gazettes are a rich source of relevant information to the public. Their careful examination may lead to the detection of frauds and irregularities that may prevent mismanagement of public funds. This paper presents a dataset composed of documents from the Official Gazette of the Federal District, containing both samples with document source annotation and unlabeled ones. We train, evaluate and compare a transfer learning based model..."

Victor: a Dataset for Brazilian Legal Documents Classification

• Pedro Henrique Luz de Araujo • Teófilo Emídio de Campos • Fabricio Ataides Braz • Nilton Correia da Silva

Language Resources and Evaluation Conference 2020

This paper describes VICTOR, a novel dataset built from Brazil’s Supreme Court digitalized legal documents, composed of more than 45 thousand appeals, which includes roughly 692 thousand documents—about 4.6 million pages.

KnEDLe/NIDO Project Partial Technical Report 1

• Teófilo E. de Campos • Thiago de Paulo Faleiros • Vinícius R. P. Borges • Isaque Alves • Carolina Alves Okimoto

Partial Technical Report

This document presents a summary of the results produced in the KnEDLe Research Project - Extraction of Information from Official Publications using Artificial Intelligence (also known as NIDO). In particular, it is a report based on the activities and results produced in the first phase (Release) of the Project, from 01/01/2020 to 06/30/2020.

The information bottleneck theory of deep learning

MSc qualification exam
Author • Frederico Guth
Supervisor • Teófilo E. de Campos, University of Brasilia
Committee • John Shawe-Taylor (UCL) • Moacir Ponti (USP) • Thiago de Paulo Faleiros (suplente - UnB)

Text and slides and other resources are available from https://cic.unb.br/~teodecampos/fred_guth/
The meeting was held using MS Teams, from 10am Brasilia time (2pm London time), on Monday 13th July 2020.

From Documents to Entities: A journey through Natural Language Processing tasks and domains

MSc Qualification Exam
Author • Pedro Luz
Supervisor • Teófilo E. de Campos, University of Brasilia
Committee • Alexandre Rademaker (FGV and IBM) • Thiago de Paulo Faleiros (UnB)

Text, slides and other resources are available from https://cic.unb.br/~teodecampos/peluz/
The meeting was held using MS Teams, from 2pm Brasilia time, on Friday 7th August 2020.

Data Extraction Library



Extraction of data from documents in PDF format referring to the publications of the Official Gazette of the Federal District.







Documentation Repository

Members

Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image
Rounded Image

Partners