Session Big Data

Schedule:November 29, 

Keynote Big data

Speaker: Jim Walker, HortonWorks
Schedule: 12:00 - 12:15am

SpagoBI and Big Data: next Open Source Information Management suite

Speaker:Monica Franceschini, Engineering
Schedule: Thursday Nov 29, 12:15 - 12:30pm
Abstract: Organizations adopt Business Intelligence tools to analyze tons of data: nonetheless, several business leaders do not dispose of the information they actually need. This happens because the information management scenario is evolving. Various new contents are adding to structured information, supported by already known processes, tools and practices, including information coming from social computing. They will be managed by disparate processes, fragmented tools, new practices. This information will combine with various contents of enterprise systems: documents, transactional data, databases and data warehouses, images, audio, texts, videos. This huge amount of contents is named “big data”, even though it is not just related to a big amount of data. It refers to the capability of managing data that are growing along three dimensions - volume, velocity and variety - respecting the simplicity of the user interface. The speech describes SpagoBI approach to the “big data” scenario and presents SpagoBI suite roadmap, which is two-fold. It aims to address existing emerging analytical areas and domains, providing the suite with new capabilities - including big data and open data support, in-memory analysis, real time and mobile BI - and following a research path towards the realization of a new generation of SpagoBI suite.

Talend: The Big Challenge of Big Data and Hadoop Integration

Speaker: Cedric Carbone, Talend
Schedule: Thursday Nov 29, 12:30 - 12:45pm
Abstract: Enterprises can't close their doors just because integration tools won't cope with the volume of information that their systems produce. As each day goes by, their information will become larger and more complicated, and enterprises must constantly struggle to manage the integration of dozens (or hundreds) of systems. Apache Hadoop has quickly become the technology of choice for enterprises that need to perform complex analysis of petabytes of data, but few are aware of its potential to handle large-scale integration work. By using effective tools, integrators can process the complex transformation, synchronization, and orchestration tasks required in a high-performance, low cost, infinitely scalable way. In this talk, Cédric Carbone will discuss how Hadoop can be used to integrate disparate systems and services, and provide a demonstration of the process for designing and deploying common integration tasks.

BPMconseil: Using Vanilla to manage Hadoop database

Speaker:Patrick Beaucamp, Bpm-Conseil
Schedule: Thursday Nov 29, 12:45 - 01:00pm
Abstract: This presentation will demo how to use Vanilla to read/write data in Hadoop database, using big data database like HBase or Cassandra, along with the use of Hadoop-Ready Solr/Lucene search engine - embeded into Vanilla - to run clustered search on Hadoop data.

PKU: Tracking code evolution for open source universe

Speaker:Minghui Zhou, Peking University
Schedule: Thursday Nov 29, 02:00 - 02:15pm
Abstract: The existing large amount of OSS artifacts has provided abundant materials for understanding how code is reused in open source universe, in particular, what code pieces are mostly reused, in what circumstances people reuse code, and so forth. Understanding this process could help with legacy software maintenance, as well as help to explore best practice of software development. Targeting the change history data of thousands of open source projects, we try to answer the following question: First, how is code reused by other projects? Second, how are code files organized in project and how does this organization structure change over time? To answer these questions, there are several technical difficulties we have to overcome. For example, because of the different kinds of VCSs, it is hard to figure out a uniform model which can represent the evolution progress of code files stored in them. Also, each VCS may have its own data format, so, extracting data from them is a big challenge. Furthermore, using current software algorithm and hardware platform to analyze the version iteration and reuse information of about a billion code files is another challenge.