Language detection in Go – Calling cld with cgo

A common task in natural language processing is the detection of the (human) input language. An easy approach to accomplish this is using the chromium compact language detector library1 which offers language detection functionality, extracted from the Chromium browser.

The library works pretty well in a lot of cases, but has some problems distinguishing similar dialects and languages. Norwegian is often detected as Danish for instance, even if there are some unique features of the corresponding language present2. I personally do not have a lot of experience with other libraries or services, but heard from several software engineers that this kind of problem is pretty common.

Wrapping the library

We first have to build another library that wraps the C++ library as we cannot call C++ function directly from Go using cgo. We also would not want to because it has a lot of parameters that are easier to set in native code.

We just call the compact language detector from another function which handles most of the stuff we do not care about in Go. This has to be a C-style function, so that the compiler does not modify the name, and the Go linker can find it.

#ifdef __cplusplus
extern "C" {
#endif

const char* detect_language(const char *text);

#ifdef __cplusplus
}
#endif


const char* detect_language(const char *text)
{
	size_t size = strlen(text);
	bool is_plain_text = true;
	bool do_allow_extended_languages = true;
	bool do_pick_summary_language = false;
	bool do_remove_weak_matches = false;
	bool is_reliable;
	const char* tld_hint = NULL;
	int encoding_hint = UTF8;
	Language language_hint = UNKNOWN_LANGUAGE;

	double normalized_score3[3];
	Language language3[3];
	int percent3[3];
	int text_bytes;

	Language lang;
	lang = CompactLangDet::DetectLanguage(0,
	                                      text,
	                                      size,
	                                      is_plain_text,
	                                      do_allow_extended_languages,
	                                      do_pick_summary_language,
	                                      do_remove_weak_matches,
	                                      tld_hint,
	                                      encoding_hint,
	                                      language_hint,
	                                      language3,
	                                      percent3,
	                                      normalized_score3,
	                                      &text_bytes,
	                                      &is_reliable);

	/* for more methods to extract information take a look 
	 * at /include/cld/languages/public/languages.h
	 */
	return LanguageCode(lang);
}

The Go package

We then can start writing the Go part – we declare a language type and call our new C library:

// #cgo CFLAGS: -I. -fpic
// #cgo LDFLAGS: -lstdc++ -L. liblangdet.a
// #include "langdet.h"
// #include <stdlib.h>
import "C"

import (
	"unsafe"
)

type Language string

func DetectLanguage(text string) (language Language) {
	cStr := C.CString(text)
	defer C.free(unsafe.Pointer(cStr))
	language = Language(C.GoString(C.detect_language(cStr)))
	return
}

Note the CFLAGS and LDFLAGS cgo variables: as we are linking a static library, we have to compile the package with -fpic and link it against liblangdet.a. We have to provide the full name of our library here – using -llangdet will not work!

Getting more detailed information

We maybe don’t want to just detect the language, we also want to know which dialect the text is written in, and maybe it would be more convenient to know if the language detector is sure that his result is correct. We therefore write a second cld call, returning some extra information.

For storing the data I defined some other types:

type LanguageDialect string

type LanguageInfo struct {
	Language Language         // language code; "en"
	Dialect  LanguageDialect  // languace code + dialect; "en-uk"
	Scores   [3]LanguageScore // the 3 most likely languages
	Reliable bool             // is the result reliable?
}

type LanguageScore struct {
	Dialect  LanguageDialect
	Percent  int              // probability/"confidence"
}

The corresponding C++ variables are the arrays language3 and percent3 together with the reliability boolean is_reliable. As it is only possible to return a single value from a C/C++ function, I used another approach – a Go callback function called from within the C++ library that takes the values as parameters, and assigns them to a Go struct.

Conclusion

Getting language detection working in Go is pretty straightforward. The longer the text, the higher the detection rate (because of more distinct features). It will be hard to detect the correct language dialect of a tweet – maybe even the language if there exist similar ones (Danish<->Norwegian). But together with other knowledge, like the user’s location that helps you distinguish the fine details, the library should enable a very useful way of dealing with human language.

Footnotes


  1. short: cld

  2. In case the Norwegian word “av” is found, detection seems always accurate, though.

First published on March 3, 2013