Word frequency in text
I think the first thing i can do with text is to get some statistic. E.g. calculate number of words and rank it descending, so that i can see most used word in text. I have wrote this code in golang to load a file with sample text, read it line by line, and process each line. I am using just map to keep words as keys and increment counter, which is value.
package algos
import (
"bufio"
"fmt"
"os"
"sort"
"strconv"
"strings"
)
func WordFrequency() {
file, _ := os.Open(`...`) // real path to sample file must be here
scanner := bufio.NewScanner(file)
for scanner.Scan() { // scan line by line
line := scanner.Text()
fmt.Println(line)
processLine(line)
}
printStat()
}
var keyWords = make(map[string]int) // map for keeping frequencies
func processLine(line string) {
for _, raw := range strings.FieldsFunc(line, split) { // split line by delimeters
str := strings.ToLower(strings.TrimSpace(raw))
if isNumber(str) { // skip numbers
continue
}
cnt := keyWords[str] + 1 // increase word counter
keyWords[str] = cnt
}
}
func isNumber(str string) bool {
_, err := strconv.Atoi(str)
return err == nil
}
// stupid checking for a rune to split by
// this one might be increased a lot by any special characters, underscores, anything similar
func split(r rune) bool {
return r == ' ' ||
r == '.' ||
r == ',' ||
r == ';' ||
r == ':' ||
r == '-' ||
r == '/' ||
r == '\\' ||
r == '"' ||
r == '“' ||
r == '”' ||
r == '&' ||
r == '©' ||
r == '$' ||
r == '(' ||
r == ')' ||
r == '[' ||
r == ']'
}
// below code is just to make sorted presentation for our word map
// collect map content to an array and sort it by counters
func printStat() {
for _, pair := range rankWords() {
fmt.Println(pair.cnt, " / ", pair.word)
}
}
type pair struct {
word string
cnt int
}
type pairList []pair
func (p pairList) Len() int { return len(p) }
func (p pairList) Less(i, j int) bool { return p[i].cnt < p[j].cnt }
func (p pairList) Swap(i, j int) { p[i], p[j] = p[j], p[i] }
func rankWords() pairList {
pairs := make(pairList, len(keyWords))
i := 0
for k, v := range keyWords {
pairs[i] = pair{k, v}
i++
}
sort.Sort(sort.Reverse(pairs))
return pairs
}
Wombats are short-legged, muscular quadrupedal marsupials that are native to Australia. They are about 1 m (40 in) in length with small, stubby tails. There are three extant species and they are all members of the family Vombatidae. They are adaptable and habitat tolerant, and are found in forested, mountainous, and heathland areas of south-eastern Australia, including Tasmania, as well as an isolated patch of about 300 ha (740 acres) in Epping Forest National Park[2] in central Queensland.
The closer to top - the most useless words we see. are, in, and, of - looks pretty much as noise words.
7 / are
5 / in
4 / and
3 / they
3 / of
2 / as
2 / about
2 / australia
1 / native
1 / tolerant
1 / south
1 / extant
1 / legged
1 / species
1 / vombatidae
1 / mountainous
1 / isolated
1 / short
1 / members
1 / an
1 / acres
1 / stubby
1 / three
1 / with
1 / adaptable
1 / central
1 / wombats
1 / length
1 / forest
1 / found
1 / queensland
1 / forested
1 / quadrupedal
1 / small
1 / tails
1 / the
1 / heathland
1 / eastern
1 / including
1 / that
1 / m
1 / all
1 / areas
1 / national
1 / to
1 / park
1 / marsupials
1 / well
1 / family
1 / habitat
1 / ha
1 / muscular
1 / tasmania
1 / patch
1 / epping
1 / there
Its interesting to think how to rank words by real meaning. What is thext about? How to extract the main word?
package algos
import (
"bufio"
"fmt"
"os"
"sort"
"strconv"
"strings"
)
func WordFrequency() {
file, _ := os.Open(`...`) // real path to sample file must be here
scanner := bufio.NewScanner(file)
for scanner.Scan() { // scan line by line
line := scanner.Text()
fmt.Println(line)
processLine(line)
}
printStat()
}
var keyWords = make(map[string]int) // map for keeping frequencies
func processLine(line string) {
for _, raw := range strings.FieldsFunc(line, split) { // split line by delimeters
str := strings.ToLower(strings.TrimSpace(raw))
if isNumber(str) { // skip numbers
continue
}
cnt := keyWords[str] + 1 // increase word counter
keyWords[str] = cnt
}
}
func isNumber(str string) bool {
_, err := strconv.Atoi(str)
return err == nil
}
// stupid checking for a rune to split by
// this one might be increased a lot by any special characters, underscores, anything similar
func split(r rune) bool {
return r == ' ' ||
r == '.' ||
r == ',' ||
r == ';' ||
r == ':' ||
r == '-' ||
r == '/' ||
r == '\\' ||
r == '"' ||
r == '“' ||
r == '”' ||
r == '&' ||
r == '©' ||
r == '$' ||
r == '(' ||
r == ')' ||
r == '[' ||
r == ']'
}
// below code is just to make sorted presentation for our word map
// collect map content to an array and sort it by counters
func printStat() {
for _, pair := range rankWords() {
fmt.Println(pair.cnt, " / ", pair.word)
}
}
type pair struct {
word string
cnt int
}
type pairList []pair
func (p pairList) Len() int { return len(p) }
func (p pairList) Less(i, j int) bool { return p[i].cnt < p[j].cnt }
func (p pairList) Swap(i, j int) { p[i], p[j] = p[j], p[i] }
func rankWords() pairList {
pairs := make(pairList, len(keyWords))
i := 0
for k, v := range keyWords {
pairs[i] = pair{k, v}
i++
}
sort.Sort(sort.Reverse(pairs))
return pairs
}
Sample text
Sample text taken from wikiWombats are short-legged, muscular quadrupedal marsupials that are native to Australia. They are about 1 m (40 in) in length with small, stubby tails. There are three extant species and they are all members of the family Vombatidae. They are adaptable and habitat tolerant, and are found in forested, mountainous, and heathland areas of south-eastern Australia, including Tasmania, as well as an isolated patch of about 300 ha (740 acres) in Epping Forest National Park[2] in central Queensland.
Result
Below is result of execution. The most frequent words are on top of the list.The closer to top - the most useless words we see. are, in, and, of - looks pretty much as noise words.
7 / are
5 / in
4 / and
3 / they
3 / of
2 / as
2 / about
2 / australia
1 / native
1 / tolerant
1 / south
1 / extant
1 / legged
1 / species
1 / vombatidae
1 / mountainous
1 / isolated
1 / short
1 / members
1 / an
1 / acres
1 / stubby
1 / three
1 / with
1 / adaptable
1 / central
1 / wombats
1 / length
1 / forest
1 / found
1 / queensland
1 / forested
1 / quadrupedal
1 / small
1 / tails
1 / the
1 / heathland
1 / eastern
1 / including
1 / that
1 / m
1 / all
1 / areas
1 / national
1 / to
1 / park
1 / marsupials
1 / well
1 / family
1 / habitat
1 / ha
1 / muscular
1 / tasmania
1 / patch
1 / epping
1 / there
Its interesting to think how to rank words by real meaning. What is thext about? How to extract the main word?