Recent Rosette® Cloud and Enterprise releases (1.17.0, 1.16.1) bring expanded language coverage to name translation and semantic similarity, and ease of use to the address matching capability within Rosette Name Indexer. We have also made improvements to Arabic-Arabic and Arabic-English name matching, as well as better morphological analysis in various languages.
Hebrew name translation
Name translation now supports translation of names from Hebrew to English (Latin) — but not yet from Latin-based names to Hebrew. Rosette includes a panoply of “overrides,” so that well-known foreign names written in Hebrew (such as “ג’ורג’ וושינגטון”) are properly translated (to “George Washington”) and not merely phonetically transliterated (to “G’vrg vvshngtvn” or similar).
The name translation defaults to a “folk transliteration” scheme, but also supports the Hebrew transliteration standard ISO 259-2:1994 and the default Hebrew transliterator implemented by ICU, which is based on the UNGEGN (United Nations Group of Experts on Geographical Names) standard. The folk transliteration scheme, created by Basis Technology, most resembles how people actually write Hebrew names with Latin characters. This folk scheme is intended to be more useful than the other more academic standards, which use diacritics and are less readable. See more about the Basis Technology transliteration scheme in the blog post “Building a More Useful Hebrew Transliteration Scheme.”
Transliteration scheme | Transliteration of רוזלינד פרנקלין |
---|---|
Folk (Basis Tech) | Ruzlind Prenklin |
ISO 259-2:1994 | Rẇzliynd Prnqliyn |
ICU | Rẇzĕliynĕd Pĕrĕnĕqĕliyn |
French semantic similarity
Semantic similarity now supports French. It can transform French words into a numerical representation (vectors), which can be used to compare the meaning of French words to each other, or to words in eight supported languages.
For example, here are similar terms generated for the English word “spy” in German and French:
English
"term": "spy", "similarity": 1.0
"term": "spies", "similarity": 0.66227961
"term": "spying", "similarity": 0.65423775
"term": "spymaster", "similarity": 0.60325158
"term": "cia", "similarity": 0.57148194
French
"term": "espion", "similarity": 0.54824299
"term": "espionne", "similarity": 0.49286559
"term": "espionnes", "similarity": 0.41175416
"term": "secrets", "similarity": 0.39606363
"term": "escroc", "similarity": 0.36654109
German
"term": "Deckname", "similarity": 0.51391315
"term": "GRU", "similarity": 0.50809389
"term": "Spion", "similarity": 0.50051737
"term": "KGB", "similarity": 0.49981388
"term": "Informant", "similarity": 0.48774603
It is also possible to return a series of similar terms from any supported language based on a French word or phrase/sentence, or compare the content of French documents.
Unfielded fuzzy address matching
Address matching within Rosette Name Indexer (SDK) now supports unfielded addresses (whole addresses in one field) and misfielded address components (components in the wrong field).
Example of unfielded address matching:
{
"address1": "The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom",
"address2": {
"number": "100-106",
"road": "Leonard St",
"city": "Shoreditch",
"postcode": "EC2A 4RH"
}
}"
Example of misfielded address matching:
{
"address1":
{ "houseNumber": "160",
"road": "Pennsylvania Ave N.W.",
"city": "Washington",
"state": "D.C.",
"postcode": "20500"
},
"address2": {
"houseNumber": "160",
"road": "Pennsylvania Ave N.W.",
"city": "D.C.",
"state": "Washington",
"postcode": "20500"
}
}
We have also increased the number of overrides — lists that explicitly map nicknames, cognates, and variants — to improve matching accuracy. The example below shows how adding “UK” and “England” to the overrides list changed the match score for two otherwise identical addresses.
{
"address1": {
"house": "Ffrwdgrech Industrial Estate",
"road": "Ffrwdgrech Rd",
"city": "Brecon",
"country": "UK",
"postcode": "LD3 8LA"
},
"address2": {
"house": "Ffrwdgrech Industrial Estate",
"road": "Ffrwdgrech Rd",
"city": "Brecon",
"country": "England",
"postcode": "LD3 8LA"
}
}
Previous score without override: 0.86
New score with “UK=England” override: 0.95
Arabic name matching
Arabic-Arabic and Arabic-English name matching have improved through the addition of the following features:
- A name gender mismatch penalty for Arabic names
- Support for initials and initialisms in Arabic names
- Stop words for PERSON and ORGANIZATION names added
- Improved name token alignment
- A new Arabic-English statistical model
- Weighting of Arabic name tokens based on the rarity of the name, so that uncommon Arabic name components that match contribute more to the match score than common names.
Detection of tables and lists in Rosette Base Linguistics
Rosette Base Linguistics now detects zones of structured text (such as tables and lists) within a body of text, and notes the offsets for this structured region in the ADM (annotated data model JSON). This demarcation enables users to apply different processing to tables and lists, as opposed to sentences in downstream analyses.
Morphological analysis
In release 1.17, we have increased accuracy for: Greek POS tags and lemmatization; Russian verb lemmatization; German noun lemmatization; and Hebrew POS tags and tokenization.
Check out the release notes for all the details and bug fixes in this release. We look forward to your feedback!
The post Rosette 1.17.0 Release: Hebrew Name Translation, French Semantic Similarity, Robust Address Matching appeared first on Rosette Text Analytics.