Benchmark and compare LLMs on Hebrew reasoning, comprehension, sentiment, translation, and Israeli cultural knowledge. Wraps the HuggingFace Open Hebrew LLM Leaderboard tasks (HeQ, HebrewSentiment, Hebrew Winograd, translation) plus DictaLM 3.0 benchmark tasks (Summarization, Nikud, Israeli Trivia) into a reproducible evaluation harness. Runs evals against Claude, GPT, Gemini, AI21 Jamba, DictaLM, Llama, and local HuggingFace models. Produces comparison scorecards in JSON and markdown. Use when choosing an LLM for a Hebrew product, answering procurement questions about Hebrew performance, validating a fine-tuned Hebrew model, or tracking Hebrew regressions after a model upgrade. Do NOT use for Arabic NLP, ASR benchmarking, or general English benchmarks.
Trust score 85/100 (Trusted) · 7+ installs · 3 GitHub contributors · MIT license
Israeli product teams pick LLMs blind. There is no standardized Hebrew benchmark that a PM can run in an afternoon to compare Claude against GPT against DictaLM against AI21 Jamba on their actual use case. The HuggingFace Open Hebrew LLM Leaderboard is built for base models and few-shot prompts, not for API-hosted chat models. DictaLM publishes benchmark results but only for its own suite. Teams end up guessing, testing informally, or trusting marketing claims.
npx skills-il add skills-il/developer-tools --skill hebrew-llm-eval-suite -a claude-codeWe are building a Hebrew news summarization feature and need to pick between Claude Sonnet, GPT-5, and DictaLM-3.0-24B. Run the relevant benchmarks (HeQ, DictaLM Summarization, Winograd) with 1000 samples and 3 runs, and recommend a model with reasoning.
Anthropic released a new version of claude-sonnet. Run the hebrew-core suite on the new and previous versions and tell me if there was any regression over 2 points on any benchmark.
I am building a Hebrew chatbot and deciding between Claude Haiku and AI21 Jamba 1.5 Mini. Compare them on HeQ, HebrewSentiment, and HebNLI with 500 samples and 3 runs, and provide a scorecard with a recommendation.
We have a data residency constraint requiring a local model. Run Hebrew benchmarks on DictaLM-3.0-Nemotron-12B-Instruct and compare to Claude Sonnet quality. How much quality am I giving up?
Build full-stack apps on the Base44 platform using the JavaScript SDK. Covers CRUD operations, authentication, AI agents, backend functions, integrations, and real-time subscriptions.
Build Telegram bots with grammY, Telegraf, or python-telegram-bot. Covers Bot API v9.5 webhooks vs polling, inline keyboards, commands, middleware patterns, payments API, Mini Apps, and Hebrew message handling with RTL support. Use when building a Telegram bot, setting up webhooks, handling Hebrew messages in a bot, or integrating Telegram payments. Do NOT use for WhatsApp bots (use israeli-whatsapp-business), voice bots (use hebrew-voice-bot-builder), or general chatbot design patterns (use hebrew-chatbot-builder).
Navigate the fragmented landscape of Hebrew and Yiddish ML datasets and models. Covers ivrit.ai (22K+ hours of Hebrew audio, whisper-large-v3 ASR variants, Yiddish models), Dicta (DictaLM 3.0 LLM family, DictaBERT variants, HeQ reading comprehension), the Israeli National NLP Program / NNLP-IL (HebrewSentiment, HebNLI), AlephBERT, and Knesset Plenums. Helps researchers and ML engineers pick the right dataset for a task by use case, license (commercial vs research), Hebrew register coverage, and model-dataset pairing. Use when choosing training data for a Hebrew NLP or ASR project, verifying license compatibility for a commercial product, finding a baseline model for a Hebrew downstream task, or exploring Yiddish ML resources. Do NOT use for Arabic NLP, general HuggingFace dataset discovery, or Hebrew OCR dataset selection (use hebrew-ocr-forms).
Want to build your own skill? Try the Skill Creator · Submit a Skill