Sperduti G., Moreo A., Sebastiani F.
Garbled-Word Embeddings Garbled Words Misspellings Distributional Semantic Models
"Aoccdrnig to a reasrech at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny itmopnrat tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe". We investigate the extent to which this phenomenon applies to computers as well. Our hypothesis is that computers are able to learn distributed word representations that are resilient to character reshuffling, without incurring a significant loss in performance in tasks that use these representations. If our hypothesis is confirmed, this may form the basis for a new and more efficient way of encoding character-based representations of text in deep learning, and one that may prove especially robust to misspellings, or to corruption of text due to OCR. This paper discusses some fundamental psycho-linguistic aspects that lie at the basis of the phenomenon we investigate, and reports on a preliminary proof of concept of the above idea.
Source: IIR 2021 - 11th Italian Information Retrieval Workshop, Bari, Italy, 13-15/09/21
@inproceedings{oai:it.cnr:prodotti:457946, title = {Garbled-word embeddings for jumbled text}, author = {Sperduti G. and Moreo A. and Sebastiani F.}, booktitle = {IIR 2021 - 11th Italian Information Retrieval Workshop, Bari, Italy, 13-15/09/21}, year = {2021} }