Sketch Engine
Logo of Sketch Engine | |
Sketch Engine concordance page | |
Original author(s) | Adam Kilgarriff, Pavel Rychlý |
---|---|
Developer(s) | Lexical Computing Ltd. |
Initial release | 23 July 2003[1] |
Development status | Active |
Written in | C++, Python, JavaScript, jQuery |
Operating system | Linux, Mac OS X |
Platform | IA-32, x64 or IA-64 |
Available in | 10 languages |
List of languages English, Czech, Chinese (Traditional, Simplified), Gaeilge, Slovene, Croatian, Arabic, Spanish, French | |
Type | Corpus manager, Database management systems |
License | Proprietary software; both commercial and freeware editions are available |
Website |
www |
Standard(s) | Unicode |
Sketch Engine is a corpus manager and analysis software developed by Lexical Computing Limited since 2003. Its purpose is to enable people studying language behaviour (lexicographers, researchers in corpus linguistics, translators or language learners) to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. [2]
History of development
Sketch Engine is a product of Lexical Computing Limited, a company founded in 2003 by the lexicographer and research scientist Adam Kilgarriff.[3] He started collaboration with developer of Manatee and Bonito Pavel Rychlý, a computer scientist working at the Natural Language Processing Centre at Masaryk University[4] and introduced the concept of word sketches.
Since then, Sketch Engine is a commercial software, however all the core features of Manatee and Bonito that were developed by 2003 (and extended since then) are freely available under the GPL license within the NoSketch Engine suite.[5]
Features
- Concordance search
- Word lists
- Collocation extraction
- Word sketches
- Corpus building and management
- Keyword extraction
- Terminology extraction (both monolingual and bilingual)
- Distributional Thesaurus
- Corpus comparison
- Parallel corpus facilities
- n-grams
- Diachronic analysis (Trends)[6]
Architecture
Sketch Engine consists of three main components: an underlying database management system called Manatee, a web interface search front-end called Bonito and a web interface for corpus building and management called Corpus Architect. [7]
Manatee
Manatee is a database management system specifically devised for effective indexing of large text corpora. It is based on the idea of inverted indexing (keeping an index of all positions of a given word in the text). It has been used for indexing of billion-word-size text corpora. [8]
Searching corpora indexed by Manatee is performed by formulating queries in the Corpus Query Language (CQL).[9]
Manatee is written in C++ and has API available for a number of other programming languages including Python, Java, Perl and Ruby. Recently, it was rewritten into Go for faster processing of corpus queries.[10]
Bonito
Bonito is a web interface for Manatee providing access to corpus search. It based on the principle of the client–server model when Bonito is the client part. It is written in Python.[7]
Corpus Architect
Corpus Architect is a web interface providing corpus building and management features. As programming language was used Python.
Applications
Sketch Engine has been used by major British or other publishing houses for producing dictionaries such as Macmillan English Dictionary, Dictionnaires Le Robert, Oxford University Press or Shogakukan and four of the UK’s five biggest dictionary publishers use Sketch Engine.[11]
See also
- SkELL – a free web service for students and teachers of English language based on Sketch Engine
References
- ↑ Companies House Searched on United Kingdom's registrar of companies (Company name: LEXICAL COMPUTING LIMITED or Company number: 04841901)
- ↑ Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít (10 July 2014). "The Sketch Engine: ten years on". Lexicography (Springer Berlin Heidelberg) 1 (1): 7–36. doi:10.1007/s40607-014-0009-9. ISSN 2197-4292.
- ↑ Adam Kilgarriff's home page
- ↑ Natural Language Processing Centre, Masaryk University
- ↑ NoSketch Engine
- ↑ Kilgarriff, Adam; Herman, Ondřej; Bušta, Jan; Rychlý, Pavel; Jakubíček, Miloš (2015). "DIACRAN: a framework for diachronic analysis" (PDF). Corpus Linguistics 2015 (Lancaster: UCREL): 65–70.
- 1 2 Rychlý, Pavel (2007). "Manatee/bonito–a modular corpus manager" (PDF). 1st Workshop on Recent Advances in Slavonic Natural Language Processing (Masaryk University): 65–70.
- ↑ Pomikálek, Jan; Jakubíček, Miloš; Rychlý, Pavel (2012). "Building a 70 billion word corpus of English from ClueWeb" (PDF). Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) (Istanbul, Turkey: European Language Resources Association (ELRA)).
- ↑ Corpus Query Language (CQL) documentation
- ↑ Rychlý, Pavel; Rábara, Radoslav (2015). "Concurrent Processing of Text Corpus Queries" (PDF). Workshop on Recent Advances in Slavonic Natural Language Processing (Masaryk University): 49–58.
- ↑ "Using Computational Lexicography for Dictionary Production with the Sketch Engine". REF Impact Case Studies. University of Brighton. Retrieved 18 April 2015.
External links
- Sketch Engine website
- List of corpora available in Sketch Engine
- Discovering English with Sketch Engine – a book by James Thomas for language learners and teachers