# Text Similarity

Perform the similarity analysis on the given sentence pair, either syntactic or semantic analysis

Endpoint: POST /text/similarity
Version: 2025-08-18T18:01:56Z
Security: api_key

## Request fields (application/json):

  - `text1` (string, required)
    The text content with UTF-8 text representation

  - `lang1` (string, required)
    The two letter language code
    Enum: "en", "fr", "es"

  - `text2` (string, required)
    The text content with UTF-8 text representation

  - `lang2` (string, required)
    The two letter language code
    Enum: "en", "fr", "es"

  - `algo` (string)
    # Similarity Algorithms
## Syntactic Similarity
The syntactic similarity algorithms exclusively focus on the representational features of text. The most dominant of these features is the set of tokens (character and words) being used. Different syntactic similarity algorithms exploit these features differently to provide a measure of similarity between an input text pair. The similarity is measured based on a scale of 0 to 1, where 1 represents the best possible match, and 0 indicates the no match scenario. In addition the base algorithms, we also utilize the approach of character and/or word based [shingles](https://en.wikipedia.org/wiki/W-shingling) to add context for increasing the similarity accuracy.  The following syntactic similarity algorithms are supported:
1. : This represents the combination of applying character-based [shingles](https://en.wikipedia.org/wiki/W-shingling)  to the classic [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) algorithm.
2. : This represents the combination of applying character-based [shingles](https://en.wikipedia.org/wiki/W-shingling)  to the classic [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) algorithm.
3. : This represents the combination of applying character-based [shingles](https://en.wikipedia.org/wiki/W-shingling)  to the classic [Jaro–Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) algorithm that is similar in nature to edit distance based measures.
4. : This represents the combination of applying -based [shingles](https://en.wikipedia.org/wiki/W-shingling)  to the classic [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) algorithm. Compared to the , this algorithm will produce less false positives for larger text pieces.
5. : A Semantax proprietary algorithm that is optimized for comparison speed and accuracy. It is based on the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) algorithm and it combines both character and word based shingles.
6. : A Semantax proprietary algorithm that is optimized for comparison speed and accuracy. It is based on the classic [Jaro–Winkler] algorithm. Both character and word based shingles are combined in a weighted capacity to increase the impact of .
7. : A Semantax proprietary algorithm derived from the classic [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) algorithm. The main feature of this algorithm is the inclusion of NLP (natural language processing) primitives for higher accuracy of similarity comparisons. NLP processing includes lemmatization/stemming, term normalization etc.  This algorithm is best suited for a single sentence, or a couple of short sentences as input.
8. : A Semantax proprietary algorithm that extends  to compare a pair of input paragraphs (a set of sentences). In addition to the  features, this algorithm also includes a weighted [Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index) score of the overlapping sentences across the input pair.

## Semantic Similarity
The semantic similarity algorithms focus on comparing the input text pair based on the main concepts present in the text regardless of the words used to represent these concepts. Roughly speaking it is similar to comparing the meaning of the two sentences independent of the words used. See [here](https://en.wikipedia.org/wiki/Semantic_similarity) for more details.
Our semantic similarity algorithms are created using modern deep learning based [word embeddings](https://en.wikipedia.org/wiki/Word_embedding) trained on enterprise corpus of sample documents. The models are trained on single sentences, and/or short paragraphs as input, and therefore work best for content size in that range.

All of our semantic similarity algorithms support multi and cross lingual scenarios, where the input text pair can be expressed in any of the supported languages (for example en-en, en-fr, en-es, fr-fr, fr-es etc.). The following semantic similarity algorithms are supported:
1. : The default semantic similarity algorithm that offers the best combination of speed and accuracy with an emphasis on english-to-english common language input pairs.
2. : This semantic similarity model is trained on data from government, insurance and banking industry verticals. The model is optimized for speed but provides a good level of over all accuracy. 
3. : Similar to , this model is build on a much larger input corpus. 
4. : Builds on the same approach as the previous two models but also includes basic support for higher semantic relationships such as . 
5. : Similar to ssm28, with better similarity score distribution.
    Enum: "syn.weighted-word", "syn.simple", "syn.cosine-with-shingles", "syn.sorensen_dice-shingles", "syn.cosine-word", "syn.jw-shingles", "syn.paragraph", "syn.sentence", "sem.ssm", "sem.ssm14", "sem.ssm20", "sem.ssm28", "sem.ssm30"

## Response 200 fields (application/json):

  - `status` (object)
    response status

  - `status.success` (boolean)
    Response status

  - `status.code` (integer)
    Response status code

  - `result` (object)
    response body

  - `result.text1` (string)
    The first input sentence

  - `result.text2` (string)
    The second input sentence

  - `result.score` (number)
    The computation score of syntactic similarity (For syntactic algorithms only)

  - `result.prediction` (object)
    The computation result of semantic similarity (For semantic algorithms only)

  - `result.prediction.match` (boolean)
    The value indicate whether the input sentence pair is semantic similar or not

  - `result.prediction.conf` (number)
    The confidence score of semantic similarity