Skip to content

Sophia NLU Engine v1.0 Released

Sophia NLU Engine v1.0 Released

We're proud to release the Sophia NLU Engine v1.0, a lightweight, robust NLU (natural language understanding) engine developed in Rust, prioritizing efficiency and performance. Its compact, self-contained nature processes up to 20,000 words/sec with a highly accurate POS tagger, phrase interpreter, automated spelling corrections, and more.

Its privacy focused, self-contained design—free of external dependencies, API calls, or bulky PyTorch models—enables instant setup for in-house robust NLU solutions. Requires only a small Rust library and a 79MB (base) or 177MB (full) vocabulary data store making it suitable for even wearables.

If you deploy AI agents of any kind and wish for a better way to understand what your users are saying with no API calls or monthly bills, Sophia may be for you. It boasts:

  • Extensive vocabulary of 914k (full) / 145k (base) words, including 65k MWEs (multi-word entities), 79k named entities, along with stems, plurals, synonyms, a vast multi-hierarchical categorization system, and more.
  • Highly accurate POS tagger with custom model architecture.
  • Automated spelling corrections.
  • Advanced phrase interpreter that segments input into digestible phrases of verb / noun clauses.

Premium features are available including:

  • Binary application with localhost RPC server for instant setup across languages.
  • Easily import custom named entities into the vocabulary.
  • Query selectors providing intent classification by algorithmically matching user input to pre-defined phrases with optional LLM fallback.
  • User feedback pipeline allowing for seamless collection of user feedback.
  • Detailed usage statistics, logging, and analytics.

You may view the full feature list, test the online demo, and download the open source code including SDK at: https://cicero.sh/sophia/

Future Roadmap

A major upgrade coming soon will bring advanced contextual awareness to Sophia, transforming it into a world leading NLU engine. There are three main components to this upgrade:

  1. Existing categorization system will be replaced with a vector-based scoring system, resulting in superior word clustering and granular word filtering.
  2. Phrase interpreter will be greatly advanced and transformed from a heuristics based interpreter to a hybrid that includes many small, efficient, accurate custom models architected for the various English language constructs (eg. anaphora resolution, phrase boundary detection, classification, negation, etc.), resulting in an exceptionally accurate and robust phrase interpreter.
  3. Contextual awareness training, upon completion of which will allow Sophia to differentiate, for example, the difference between "visit google.com", "visit Mark's idea", "visit the school", "visit my parents", "visit the magical kingdom in my dream", and so on. Due to novel methodologies being used, full training details will not be divulged until its open source release.

POS Tagger

Over the coming days and weeks, multiple upgrades to the vocabulary data store will be released that will enhance the POS tagger until 100% accuracy is achieved with full confidence. All upgrades will be open source and available to the public.

Multi-Lingual Support

In the near future, full multi-lingual support will be developed, with vocabulary data stores curated and trained for each language. Romance languages will be first due to their similarity to English, but all languages will follow. Please bear with us as tonal languages such as those found throughout SE Asia, RTL, and others will take some resources to perfect.

Get Started with Sophia Today!

For full feature list, online demo, and open source download, please visit: https://cicero.sh/sophia/

Although open source and free for individual use, if you will be using Sophia for commercial use, please consider acquiring a premium license and show your support for Cicero's mission of dropping our dependence on big tech through open source innovation, as outlined in the mission statement and Origins and End Goals posts.

If you have any questions or concerns, please complete the contact form for a prompt response.

POS Tagger Inadequacies

Despite significant progress, a few remaining inaccuracies persist within the POS tagger. This post outlines the current challenge, the solution, and the path forward to achieve near-perfect accuracy—expected to be completed within the next week.

Problem - Noun Bias

Quality labelled data for a POS tagger is extremely difficult to come by. Nonetheless, training data is 229 million high quality and carefully curated tokens:

  • Distribution of 40% Wikipedia, 30% Gutenberg Project, 30% Reddit for balanced corpus.
  • All data processed through 4 different POS taggers (3 PyTorch based, plus popular NLTK Python package) and only sentences where all ambiguous words met 3 of 4 consensus score were added to training data.

On the surface this seems like a solid plan, but the results were a heavy noun bias as exampled in the below table:

Word Expected Actual
visit verb 70%, noun 25%, adj 5% verb 40%, noun 60%
fishing verb 60%, noun 35%, adj 5% verb 10%, noun 90%
sets verb 60%, noun 40% verb 34%, noun 66%
camping verb 60%, noun 35%, adj 5% verb 28%, noun 72%

The Wikipedia plus Gutenberg taking up 70% of the corpus is formal / reference heavy which would naturally give preference to nouns. Plus, those 3 PyTorch based POS taggers are bert based, and who knows, maybe even potentially trained on the same data which would also aggravate biases.

Solution

Considering the problem, the solution is rather straight forward.

1/ Drop one PyTorch POS tagger, and include Stanza and Flair, then use 3 of 5 consensus score to include sentences within training data but only allow one PyTorch based tagger to be included within the score. This will diversify the scoring amongst architectures, giving stronger and more balanced results. 2. Further diversify the data origins, and include more dialogue / conversational sets such as OpenSubtitles, Switchboard, DailyDialog, Cornell Movie Dialogues, and others, providing a more verb heavy balance.

Those two points should clear up any and all discrepancies within the data, along with any potential biases within the POS tagger consensus.

Model Architecture

The model architecture itself is well designed with resilience, and will result in a near 100% accurate POS tagger. Importantly, instead of assigning a POS tag to every token like PyTorch and other taggers do, this only concentrates on the 28,221 identified ambiguous words, while leaving out the single tag words that make up the majority of the vocabulary.

The model utilizes the 8 preceding and 4 following words around each ambiguous word for context, and consists of three tiers:

  1. tag -> tag: Predicts the POS tag based on the surrounding context, and only used to autocorrect spelling mistakes as predictably knowing the POS tag of the typo reduces search space.
  2. tag -> word: Essentially a small model for each ambiguous word that will handle the vast majority of resolutions (~95%) by identifying the correct POS tag for the ambiguous word based on the tags surrounding it.
  3. word -> word: After validation, a set of edge cases that cannot be handled via the tag -> word tier will be found. For these instances, this more comprehensive tier will take effect and look at the specific words around the ambiguous words instead of just POS tags, along with any other necessary custom heuristics to ensure near flawless ambiguity resolution.

Currently context is stored as bigrams, but may be expanded to gather trigrams and quadgrams during training, then pruned and aggregated as necessary depending on which statistical scores result in the best precision.

Exact Match Tries: Designed into the model, although not currently implemented due to the noun bias issue, after training the context scores will be aggregated and an optional trie structure generated for each word which will only contain occurrences where smaller contexts of POS tags result in a certain POS tag 100% of the time. This will be used as an efficient first line filter before scoring the bigram contexts, and will also allow for the pruning of the bigrams helping keep the model lean and efficient.

Along with the above, the upcoming contextual awareness upgrade will naturally also be implemented into the POS tagger dramatically strengthening it even further. Using a bulky PyTorch based model for a POS tagger is impractical, and results in an inefficient and inaccurate model as showcased above due to the need for a consensus score to ascertain quality labeled training data. The custom model above will result in a compact, highly efficient world leading POS tagger that garners near flawless accuracy.

Moving Forward

The above solution is straightforward and will deliver a world class POS tagger within a week, forming a critical component of the Sophia NLU engine. Shortly after, the contextual awareness upgrade will be released, built with the same quality and precision.

Cicero's general release is well within reach, including its reliable text to action pipeline, helping ensure our personal data and attention remains private to us and away from big tech. However, the project is currently out of runway, with the only available compute being a RTX 3050 (4GB vRAM) which is insufficient to process large data sets through the Stanza and Flair models.

Help propel Cicero with its open source, privacy focused mission, while also acquiring a world leading NLU engine. If you're building tools that need high-quality NLU, or if you believe in the Cicero mission, support development by purchasing a premium license for the Sophia NLU engine. You will get:

  • Instant access to the software
  • Updated POS tagger within a week
  • The contextual awareness upgrade at no additional cost

The license price will triple once the contextual awareness upgrade is released, making now an ideal time to purchase.

Explore full details including online demo of the Sophia NLU engine at: https://cicero.sh/sophia/

Full details regarding the upcoming contextual awareness upgrade available at: https://cicero.sh/sophia/future

Quick Reply

You need to be logged in to reply to this thread.

Login Register
Subscribe to Cicero Updates

Stay updated with the latest news and forum discussions