Noise words


One day i just encountered here 
a dictionary of noise words. Those words are not included in a search index "to improve query performance and prevent unnecessary index growth. These commonly used words are ignored when you run a query. The alphabet file determines how single characters and spaces are handled in a query."

The dictionary is:

a, about, after, all, also, an, another, any, are, as, and, at,
be, because, been, before, being, between, but, both, by,
came, can, come, could,
did, do,
each, even,
for, from, further, furthermore,
get, got,
has, had, he, have, her, here, him, himself, his, how, hi, however,
i, if, in, into, is, it, its, indeed,
just,
like,
made, many, me, might, more, moreover, most, much, must, my,
never, not, now,
of, on, only, other, our, out, or, over,
said, same, see, should, since, she, some, still, such,
take, than, that, the, their, them, then, there, these, therefore, they, this, those, through, to, too, thus,
under, up,
very,
was, way, we, well, were, what, when, where, which, while, who, will, with, would,
you, your

Well, lets code to check this

package algos

import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)

var noiseWords []string

func RemoveNoise() {
getNoiseDictionary()

source := ""
file, _ := os.Open(`sample.txt`)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
source += scanner.Text()
}
sourceLen := len(source)
fmt.Println("\n Noised: (" + strconv.Itoa(sourceLen) + ")")
fmt.Println(source)

result := removeNoiseFromString(source)
resultLen := len(result)
fmt.Printf("\n Noiseless: (%d) %.2f\n", resultLen, float64(resultLen)/float64(sourceLen))
fmt.Println(result)
}

func removeNoiseFromString(text string) string {
noiseLess := ""
for _, rawWord := range strings.Split(text, " ") {
word := strings.ToLower(strings.TrimSpace(rawWord))
if contains(noiseWords, word) {
continue
}
noiseLess += word + " "
}
return noiseLess
}

func getNoiseDictionary() {
file, _ := os.Open(`noise_words.txt`)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
words := strings.Split(line, ",")
for _, word := range words {
word = strings.TrimSpace(word)
if word == "" {
continue
}
noiseWords = append(noiseWords, word)
}
}
}

func contains(arr []string, item string) bool {
for _, a := range arr {
if a == item {
return true
}
}
return false
}
and check this simple idea:

First try

I took Wombat description from wiki 

 Noised: (495)
Wombats are short-legged, muscular quadrupedal marsupials that are native to Australia. They are about 1 m (40 in) in length with small, stubby tails. There are three extant species and they are all members of the family Vombatidae. They are adaptable and habitat tolerant, and are found in forested, mountainous, and heathland areas of south-eastern Australia, including Tasmania, as well as an isolated patch of about 300 ha (740 acres) in Epping Forest National Park[2] in central Queensland.
 Noiseless: (363) 0.73
wombats short-legged, muscular quadrupedal marsupials native australia. 1 m (40 in) length small, stubby tails. three extant species members family vombatidae. adaptable habitat tolerant, found forested, mountainous, heathland areas south-eastern australia, including tasmania, isolated patch 300 ha (740 acres) epping forest national park[2] central queensland.

This text does not actually loses sense,
though becomes shorter, as 0.73 of initial size,
but what if we try another one?

Second try

Taken some "inspirational" text from here 

 Noised: (1215)
Sooner or later all people experience the lack of desire and resources to achieve a goal, which was set before. They think that they`ve done everything and haven`t got such a desirable result. The goal seems to be insurmountable… If you have these problems now, then the time of good inspirational quotes and sayings comes!You have to know that it`s very important to control all your thoughts: the way you think now is the way you act later! If you don`t have that kind of faith in yourself, then you don`t have to be surprised that everything you do is largely ineffective. Are you afraid of being a loser? It`s ok! All people have been through this. The thing, which was crucial and helpful to their success includes best inspirational quotes!Don`t believe that simple words can change your life? Of course, they cannot! Motivational quotes will change your attitude to your place in the world, and you`ll change everything yourself! Great inspiring quotes from famous people and uplifting quotes about life will not let you forget how powerful you are!If the problem of the zero motivation isn`t about you, take care of your unfortunate friends with some positive quotes for the day or inspirational messages!
 Noiseless: (789) 0.65
sooner later people experience lack desire resources achieve goal, set before. think they`ve done everything haven`t desirable result. goal seems insurmountable… problems now, time good inspirational quotes sayings comes!you know it`s important control thoughts: think act later! don`t kind faith yourself, don`t surprised everything largely ineffective. afraid loser? it`s ok! people this. thing, crucial helpful success includes best inspirational quotes!don`t believe simple words change life? course, cannot! motivational quotes change attitude place world, you`ll change everything yourself! great inspiring quotes famous people uplifting quotes life let forget powerful are!if problem zero motivation isn`t you, care unfortunate friends positive quotes day inspirational messages!

Thoughts

This text is definitely more noisy, since resulted text is only 0.65 of initial size.
And this is not all. Obviously dictionary need to be extended by idioms, noisy phrases, so on.
Look, what a "Sooner or later", this is water in a text, it must be dropped off. But
dictionary does not contain each word separately, nor "sooner", not "later", so it remained in a text.
My opinion is that I want to handle not separate words, but phrases. I wanna look for such dictionary, or collect it by myself, for the next part.
And code is rather not suitable here, as it splits string by words. This way is weak, looking forward to improving.

Popular Posts