oscar-corpus/oscar
Updated • 682 • 207
How to use tau/tavbert-ar with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="tau/tavbert-ar") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("tau/tavbert-ar")
model = AutoModelForMaskedLM.from_pretrained("tau/tavbert-ar")An Arabic BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020).
import numpy as np
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("tau/tavbert-ar")
tokenizer = AutoTokenizer.from_pretrained("tau/tavbert-ar")
def mask_sentence(sent, span_len=5):
start_pos = np.random.randint(0, len(sent) - span_len)
masked_sent = sent[:start_pos] + '[MASK]' * span_len + sent[start_pos + span_len:]
print("Masked sentence:", masked_sent)
output = model(**tokenizer.encode_plus(masked_sent,
return_tensors='pt'))['logits'][0][1:-1]
preds = [int(x) for x in torch.argmax(torch.softmax(output, axis=1), axis=1)[start_pos:start_pos + span_len]]
pred_sent = sent[:start_pos] + ''.join(tokenizer.convert_ids_to_tokens(preds)) + sent[start_pos + span_len:]
print("Model's prediction:", pred_sent)
OSCAR (Ortiz, 2019) Arabic section (32 GB text, 67 million sentences).