I build, (sometimes with the help of very proficient people) software prototypes for machine learning applications, mostly to solve Information Retrieval, Text Mining and Natural Language Processing tasks. The software’s described in this page were presented in various national and international scientific evaluation campaigns.
Until recently, I mostly coded in Perl and Java and used SVM Lib, BoosTexter, CRF++ and Weka for machine learning purposes. I am currently working with Python, Scikit, TensorFlow and Keras. When I work in academic context, we try as much as possible to make the experimental code available on line (trough github) for reproductibility purpose.
Numerous other applications were developed in Industrial context with my teams using ML technologies like Markov Logic Network, Perceptron, and frameworks like Sol-R, Lucene, Elastic or Hadoop 🙂
Evaluation Campaigns
2013 | Collaborative tagging | 1st system (task1) |
2008 | Classification | 1st system |
2007 | Classification | 2d system (student) |
2014 | Entity Linking | 10 th system |
2013 | Entity Linking | 13th overall / 3 rd no-wiki |
2011 | Co-Reference | 9th system (after corrections on metrics decided by the revision committee on dec 2013/ updated results to appear on website and in a future paper, see metrics updates and news here ) |
2008 | Named Entities | 1st system (3 tasks / 4) |
Softwares
- Text Miner for DEFT 2013 : The DEFT challenge is an annual French-speaking text mining evaluation challenge. This 9th edition focused on the automatic analysis of recipes in French. This system obtained best results on task 1 and 2d best results on task 2 of the DEFT 2013 campaign.
- SemLinker : a system built for the NIST-TAC KBP 2013 evaluation campaign. SemLinker is an experiment platform intended to study and solve various aspect of semantic annotation. An improved version developed by the CSFG team was deployed for the NIST 2014 eval.
- Poly-co co-references solver. This system integrates a multilayer perceptron classifier in a pipeline approach. Some heuristics are used to select the pairs of coreference candidates fed to the network for training, and our feature selection method. The features used in our approach are based on similarity and identity measures, filtering information, like gender and number, and other syntactic information. Evaluated in ConLL 2011 Shared Task.
Deprecated or not available anymore (I Plan to make the code and the ressources available in GitHub soon):
- Wikimeta : The Wikimeta platform was the achievement of 4 years of research in the field of machine learning, information extraction, natural language processing and semantic annotation. It provides an high quality information extraction engine, including high level text-mining with unique functionality. Performances of Wikimeta are evaluated on standard corpora, and in scientific evaluation campaign with state of the art metrics. The Named Entity recognition module of Wikimeta is derived of the one used in the ESTER 2 evaluation campaign (LIA Team) for the Named Entity Recognition task and obtained the best overall performances. Wikimeta life cycle is ended since the TAC 2014 KBP evaluation campaign, where it was used for the last time. During a one year period in 2013, Wikimeta was marketed by the Wikimeta Ltd startup as an API and was used in numerous production environment (including by Dailymotion).
- NLGbAse : NLGbAse is an architecture to product Metadatas and Components devoted to Natural Language Processing and semantic analysis and labeling tasks from Wikipedia content. NLGbAse transforms encyclopedic text contents into structured knowledge fully integrated with the LinkedData network and the Semantic Web. The last version of NLGbAse was published in 2013 for the NIST KBP evaluation campaign and ended its life cycle. The code is writen in perl.