LLamaSharp

In addition to previous article where i was trying to fine tune model here is one more note

We have figured out how to fine tune models and more importantly faced the pipeline of model conversion between different formats

So i was wondering if it will be possible to run model from withing C# without relying on 3rd party server running (aka not ollama, no mlx, etc)

And it is possible, there is LLamaSharp that can run GGUF models

And we did prepare GGUF model so, technically we are ready to go, and indeed it is quite simple and literally works out of the box, aka:

dotnet new console
dotnet add package LLamaSharp
dotnet add package LLamaSharp.Backend.Cpu

and our Program.cs

using LLama;
using LLama.Common;
using LLama.Native;
using LLama.Sampling;
using Microsoft.Extensions.Logging.Abstractions;

// dotnet add package LLamaSharp
// dotnet add package LLamaSharp.Backend.Cpu

// Tick to disable stderr logging from LlamaSharp
NativeLibraryConfig.All.WithLogCallback((level, message) => { });

var modelPath = "/Users/mac/Desktop/finetun/model.gguf";

// For Apple Silicon (M1/M2/M3/M4):
// - GpuLayerCount = -1 means offload ALL layers to GPU (Metal), 0 - means CPU only
// - ContextSize = context window (2048 is safe, model supports up to 32768)
// - FlashAttention = true for faster inference on Metal
var parameters = new ModelParams(modelPath)
{
  ContextSize = 2048,           // Can increase up to 32768 if needed
  GpuLayerCount = -1,           // -1 = all layers to GPU (Metal on macOS)
  FlashAttention = true,        // Faster attention on Metal
};

using var model = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters, NullLogger.Instance);

var executor = new InteractiveExecutor(context, NullLogger.Instance);

var input = "Write a function that calculates the average of two numbers.";

var inferenceParams = new InferenceParams
{
  MaxTokens = 256,
  SamplingPipeline = new DefaultSamplingPipeline
  {
    Temperature = 0.7f,
  },
  AntiPrompts = ["<|im_end|>", "<|endoftext|>"]  // Stop tokens for Qwen
};

await foreach (var token in executor.InferAsync(input, inferenceParams))
{
  Console.Write(token);
}

Console.WriteLine();

so we can dotnet run and it will print model output and now we can use some simple models right from dotnet.