Sep 11, 2010

Experimental Thaana Spell-checking with Hunspell

During our (MOSS) recent meeting with the presidents office about the Zekr translation project, we came to realize that "spell checking" thaana writing is something that people really want - especially in government offices. This gave me incentive to follow up on my previously failed experiments with this.

[NOTE: I am aware that this feature is available in "Xiosis Scribe", however their approach to this appears to be somewhat lacking and IMHO disappointing. Also, the closed source nature of their system pisses me of]

I decided to try and implement this by building a custom dictionary for hunspell. This approach has several advantages too great to be ignored. a) it eliminates the need to write the spell-check engine from the ground up. b) It provides immediate integration into software like openoffice and firefox (and potentially even MSWord) to name a few. Of course there is the risk that it might not be able to handle dhivehi's "rules" - but as I always say if the Arabs and the Hebrews can do it - and it has been done - then so can we.

So I used the scripts I wrote for The Radheef project to generate words, and wrote a simple affix file to munch it down with. Using a dictionary comprising of all the "headwords" in the Radheef, and this affix file I was able to obtain some promising results - enough to convince me that it just may be possible to achieve reasonable results using Hunspell.

and the saga continues...

The contents of the Affix file

The test file to run the spell check on

We give hunspell a chance to process it...


and this is the resulting output ("misspelled" words with their suggestions)

I hope I got the attention of some talented young out there and hope that you take an interest in this.

19 Comments:

Anonymous said...

man you are on a roll!

Anonymous said...

Good work, All the best

Shifau

Huxen said...

This is great dude! I'm impressed!

Anonymous said...

I want to see a comparison of this approach vs. Xiosis

Anonymous said...

At the very least if you can implement correction of "thiki jehi thaana" I'll be willing to use it.

Anonymous said...

sounds very interesting.
have been trialing xiosis.
getting spell check for dhivehi would be a very challenging task.
can't wait to see how your project develops :)

Anonymous said...

during a conversation with a friend recently, he pointed out that unavailability of a spell checker for thaana is a feature of thaana writing and dhivehi language itself. english or arabic having a spell checker does not mean that all languages should follow that. sometimes it may make things complicated. in the case of thaana, a very scientific phonetic writing system for eg: any reader understands the words written in different spellings, without limiting to a certain set. it also allows the user to create new words, or import new words from other languages hence defining it on the set of writing. it's more forgiving in that manner without the limitation of a spell checker.
you could do a quantitative (instead of a qualitative) survey with a null hypothesis just to see if its really necessary. you should talk to thaana typists, writers, and some who are not exactly computer programmers or western educated ms office addicts. just a thought to consider.

SoE said...

@anonymous: The reason why it doesn't exist is because people have not bothered to try. Dhivehi, like any language is based on a set of rules which -however flexible they may be - determine the "proper" way to write and speak it. If somebody types "phat" instead of "fat", or "becoz" instead of "because" it too would be automatically be understood by anybody familiar with english. This does not however make it correct; Nor does this make it entirely wrong. The complication only arises when people forget that spell checkers are "writing aids". They are by no definition the backbone of the language as you seem to fear that it might become. There is no law that says what the spell checker says HAS to be correct.

The necessity of such a system is quite evident in everyday life. Just look at the various online newspapers and magazines. The idea is not to hinder creativity, but to point out mistakes which may otherwise be lost on the proof reader.

In either case, the project is bound to produce valuable data for further applications. Do not be so narrow minded as to think that research into developing a "spell checker" would stop at just that.

If we do not give the language this sort of attention, the true value of it's heritage will be lost within a decade surely. A language will not "evolve" unless it's people find out new and more creative ways to use it... sitting back and simply saying "this is not needed so it should not be done" is not something I'm prepared to do.

Anonymous said...

don't worry about the pessimist SoE I think you are doing a wonderful job

SoE said...

I wouldn't call the guy a pessimist. I do value critique...

Anonymous said...

Interesting...

I've been experimenting with spell checking for Thaana for a while now. Feeding a database of words as above is one approach and simple. My first approach was somewhat similar to yours above: I created a custom dictionary for MS Word using Radheef entries. That performs well at suggesting alternatives for misspelled words that are in the Radheef.

But IMO, the best approach to the task is to have a stemming engine/rules and a segmentation engine/rules, so that the spell checker can take base words and transform it to the various tenses etc and be able to split up words/phrases (it is very common practice when writing Dhivehi for two words that should be written separately to be combined together and vice versa). My understanding of Dhivehi grammer is rudimentary so work has slowed down to a crawl but hopefully I can get it moving again soon when I have time.

Best of luck with your work!

SoE said...

that was my intention as well jaa. Hunspell does provide for creating stemming rules (see the SFX in the affix - I was experimenting with stemming the word "kievun" and managed to have "kievumakee" recognized as a stem)

it is however quite a challenge..mostly because like you, my understanding of Dhivehi grammar is quite rudimentary.

as far as segmentation goes... well... let's just say I've personally never gotten the hang of it.

Anonymous said...

semiotically speaking dhivehi & thaana writing can be considered as something evolved beyond the organic rule based languages, and there are many languages like that. with no spell checking necessary. languages and words die due to spell checkers and aids which limit the writers thinking. but then again the society of google would prefer to google everything up. if it doesnt exists google. it doesnt exist at all. i would still suggest to do a background study of this subject instead of trying to apply what you perceive as necessary just because ms word has that feature. Trying is ok. you can even try putting icing on rihaakuru, just because cakes got it, but would it do any good? it might. try it with more research i'd suggest.

good luck.

Anonymous said...

very interesting. Best of luck with your research.

AyaBuddy said...

Hello,
I am working on a similar project and would be very happy to get some assistance from you :)
Please tell me a way to contact you :)

SoE said...

hit me up on twitter. link on page nav on top.

AyaBuddy said...

cannot send a message because you're not following me lol.
would it be considered creepy if i send you a message on facebook? haha

SoE said...

not at all. go right ahead.

Unknown said...

font thai

Post a Comment