RENSEIGNEMENTS
Tel : 06 83 10 82 97
e-mail: mdr at irif.fr
Michel de Rougemont

edit SideBar

Seminaires

DATABASES and DATASCIENCE

2019-2020

  • Mercredi 24 Juin 2020, 10h-12h, Plateforme BBB
  • Mercredi 18 Décembre 2019 CRED, 14h-17h

2018-2019

  • Mardi 21 Mai 2019 CRED, 14h-17h
    • 14h, N. Sypratos, University Paris South. Data Exploration in the HIFUN Language (ongoing work)

Abstract. When big data sets are stored in databases and data warehouses data exploration usually involves ad hoc querying and data visualization to identify potential relationships or insights that may be hidden in the data. The objective of this work is to provide support for these activities in the context of HIFUN, a high level functional language of analytic queries proposed recently by the authors. Our contributions are: (a) we show that HIFUN queries can be partially ordered and this allows the analyst to drill down or roll up from a given query during data exploration, and (b) we introduce a visualization algebra that allows the analyst to create desirable visualizations of query results.

  • 14h45, D. Laurent, Université de Cergy-Pontoise. Four-Valued Semantics for Data Integration under the OWA: a Deductive Database Approach

In this work, we consider the standard scenario whereby several data sources are to be integrated into a single database. In this scenario, all data sources are seen as consistent sets of facts meant to be either true or false. Integrating such data sources means for us combining their content so as to build a consistent database over which queries are answered using rules a la Datalog. In this scenario four cases are possible for a candidate fact ϕ: (1) all data sources that contain ϕ state that ϕ is true, (2) all data sources that contain ϕ state that ϕ is false, (3) some data sources containing ϕ state that ϕ is true and the other data sources containing ϕ state that ϕ is false, (4) no data source contains ϕ.

The most intuitive and well-known way to take this situation into account is the four-value logics introduced by Belnap where the four truth values are t (expressing truth), f (expressing falsity), b (expressing contradiction) and n (expressing absence of knowledge).

In our approach, the result of an integration for a fact ϕ is a pair (ϕ, v) where v is one of the truth values listed above, thus reflecting the OWA in the sense that the absence of information does not mean falsity. The integrated database D is then a pair (E, R) where E is a set of pairs such that the truth value is different from n and where R is a set of generalized DatalogNeg rules, that is rules whose body is a set of positive or negative atoms and whose head is a positive or negative atom. Since these rules allow to derive true facts and false facts, contradicting derivations are possible, and taken into account in our semantics.

In this setting, we show that every database D = (E, R) has non unique minimal models (with respect to set inclusion) and we define an operator (inspired from the immediate consequence operator of Datalog) for computing one of these minimal models, which we call the semantics of D. Our choice is based on the intuition that, for an implication to hold, whenever possible, truth (resp. contradiction) should imply truth (resp. contradiction).

  • 15h30. Achraf Lassoued, Univ. Paris II. Topic modelling with LDA and social networks.
  • 16h15. Michel de Rougemont, Univ. Paris II and IRIF. Some useful decompositions.
  • Mardi 27 Novembre 2018: CRED, 14h-17h
    • 14h, Nicolas Spyratos, "DROLL: A Formal Framework for Visualizing and Exploring

Analytic Query Results":In recent work we introduced HIFUN, a functional query language for studying analytic queries in the abstract and provided algorithms for mapping HIFUN queries as queries in lower level evaluation mechanisms (SQL, SPARQL, MapReduce Platforms, etc.). In our current work we show that the HIFUN queries form a complete lattice and we use this lattice to propose a formal framework that allows a user to perform three operations: - visualize : focus on a query Q, evaluate it and visualize its result or part thereof - drill-down : move to a query more informative (finer) than Q - roll-up : move to a query less informative (coarser) than Q

 A sequence of the above operations is called a visual exploration and the set of all visual explorations is the DROLL language. 
  • 14h45-15h. Guillaume Vimont (PhD candidate): The content Correlation of streaming edges, (15 mins)
  • 15h15-16h. Michel de Rougemont: Recommendation Systems.

Abstract: Let A be a large (m,n) matrix specifying that customer i likes product j. Assume A is close to a low-rank matrix and that we only know 1% of the values of A, the others are "null". Can we recover A? More precisely, can we sample an unknown value of of a row A(i)? This is precisely what Amazon et Netflix do! In DataScience, even if we only know a tiny fraction of the data, it is enough, contrary to Databases.

2017-2018

DATABASES and DATASCIENCE

  • Lundi 18 décembre 2017: CRED, 14h-17h
    • 14h, Nicolas Spyratos, Query rewriting for MapReduce jobs and Group-by queries.
    • 15h, Dominique Laurent, Mining Multi-Dimensional and Multi-Level Sequential Patterns.

Considering that panel data on one hand and multi-dimensional and multi-level sequential patterns on the other hand, deal with data with similar features, I’ll present previous work on sequential pattern mining. The question is then to see how these frequent patterns could help in achieving better predictions in the context of panel data.

  • 16H, Michel de Rougemont, Prediction & Integration of Time Series.

I'll review basic techniques to make predictions in the context of panel data. I will then argue for the integration of panel data with external information. Twitter can be viewed as a global filter which aggregates the outside world and provides such external information. The challenge is to show how to achieve better predictions. A case study are Time series for the cryptocurrencies such as Bitcoins or Ripples correlated with #Bitcoin, #Ripple, #xrp on Twitter.

2016-2017

DATABASES versus DATASCIENCE

  • Mercredi 31 Mai 2017: salle 06 à Assas, 13h15-15h30
    • 13h15-14h, N. Spyratos. Formal Approach to Rewriting Map-Reduce Jobs and SQL Group-by Queries.

We use the HiFun language presented in our previous meeting as a common formal framework for studying rewriting of analytic queries, whether they are expressed as map-reduce jobs or as SQL Group-by queries. To do this, we first show that there are 1-1 mappings between Hifun and Map-Reduce Jobs on the one hand and between HiFun and SQL Group-by queries on the other. We then derive a rewriting sheme that works as follows: 1. We abstract the map-reduce job or the group-by query as a HiFun query. 2. We rewrite the HiFun abstraction (using the rewriting rules of HiFun). 3. We encode the rewritten HiFun abstraction in map-reduce or in SQL.

  • 14h-14h45, D. Laurent. Update Semantics for Databases satisfying Tuple Generating Dependencies

TGD satisfaction implies in many cases to allow null values in the database instance. Whereas query answering in this framework has been the subject of significant research work (for instance in the domain of graph databases with blank nodes or in the domain of data transfer), the problem of updates is still open. In this presentation, we first motivate our work with examples and then we outline our approach for inserting or deleting sets of tuples in this framework.

  • 14h-45-15h30, M. de Rougemont. Approximate integration of streaming graph edges

For a stream of graph edges from a Social Network, we approximate the communities as the large connected components of the edges in a reservoir sampling, without storing the entire edges of the graph. We show that for a model of random graphs which follows a power law degree distribution, the community detection algorithm is a good approximation. Given two streams of graph edges from two Sources, we define the Community Correlation as the fraction of the nodes in communities in both streams. Although we do not store the edges of the streams, we can approximate the Community Correlation and define the Integration of two streams. We illustrate this approach with Twitter streams. We then study the extension to spanning trees.

  • Lundi 27 Février: 14h-17 au CRED: Centre de Recherche en Economie du Droit, http://cred.u-paris2.fr/ 21, Rue Valette (next to the Panthéon), 75005 PARIS
    • 14h-14h40, Nicolas Spyratos (University Paris South): Big Data Analytics - From SQL to Hadoop and beyond.

I will present HiFun, a high level functional query language for data analytics and will show how one can use this language to give the formal definition of a recursive analytic query, as well as of result visualization and visual result exploration. This is common work conducted jointly with Tsuyoshi Sugibuchi of Custoemer Matrix.

  • 14h40-15h20, T. S. (Buchi): TBD
  • 15h20-16h, Dominique Laurent (University Cergy-Pontoise, joint work with Mirian Halfeld Ferrari Alves (LIFO)): Sur les mises à jour des bases de données RDF(S)

Parmi les nouveaux modèles de bases de données connus sous le nom de NoSQL, RDF(S) se situe parmi les plus répandus. Ce modèle fait ainsi l’objet de nombreuses recherches notamment en ce qui concerne sa sémantique et les langages de requêtes associés. Toutefois, la problématique des mises à jour des bases de données RDF(S) pose de nouveaux défis liés au modèle. L’exposé sera consacré à nos travaux actuels sur les mises à jour dans le contexte des bases de données déductives qui généralise celui des bases RDF(S). Après avoir introduit présenté les principales caractéristiques de notre approche, nous montrerons comment celle-ci s’applique au cas de RDF(S) puis nous discuterons certaines extensions possibles de notre travail.

  • 16h-16h40, Michel de Rougemont (University Paris II and IRIF): Uncertainty in Data: probabilistic databases versus probabilistic algorithms.

Uncertainty can be treated in many different ways. I'll review the recent results of D. Suciu (VLDB 2012) on probabilistic databases and contrast the approach with approximate randomized algorithms which guarantee the quality of the approximation.

UP2